自学围棋的AlphaGo Zero,你也可以造一个( 三 )


三位一体的修炼
狗零的修炼分为三个过程,是异步的 。
一是自对弈 (Self-Play),用来生成数据 。
1def self_play():
2 while True:
3 new_player, checkpoint = load_player()
4 if new_player:
5 player = new_player
6、7 ## Create the self-play match queue of processes
8 results = create_matches(player, cores=PARALLEL_SELF_PLAY,
9 match_number=SELF_PLAY_MATCH)
10 for _ in range(SELF_PLAY_MATCH):
11 result = results.get()
12 db.insert({
13 "game": result,
【自学围棋的AlphaGo Zero,你也可以造一个】14 "id": game_id
15 })
16 game_id= 1
二是训练 (Training),拿新鲜生成的数据 , 来改进当前的神经网络 。
1def train():
2 criterion = AlphaLoss()
3 dataset = SelfPlayDataset()
4 player, checkpoint = load_player(current_time, loaded_version)
5 optimizer = create_optimizer(player, lr,
6 param=checkpoint['optimizer'])
7 best_player = deepcopy(player)
8 dataloader = DataLoader(dataset, collate_fn=collate_fn,
9 batch_size=BATCH_SIZE, shuffle=True)
10、11 while True:
12 for batch_idx, (state, move, winner) in enumerate(dataloader):
13、14 ## Evaluate a copy of the current network
15 if total_ite % TRAIN_STEPS == 0:
16 pending_player = deepcopy(player)
17 result = evaluate(pending_player, best_player)
18、19 if result:
20 best_player = pending_player
21、22 example = {
23 'state': state,
24 'winner': winner,
25 'move' : move
26 }
27 optimizer.zero_grad()
28 winner, probas = pending_player.predict(example['state'])
29、30 loss = criterion(winner, example['winner'],
31 probas, example['move'])
32 loss.backward()
33 optimizer.step()
34、35 ## Fetch new games
36 if total_ite % REFRESH_TICK == 0:
37 last_id = fetch_new_games(collection, dataset, last_id)
训练用的损失函数表示如下:
1class AlphaLoss(torch.nn.Module):
2 def __init__(self):
3 super(AlphaLoss, self).__init__()
4、5 def forward(self, pred_winner, winner, pred_probas, probas):
6 value_error = (winner - pred_winner) ** 2
7 policy_error = torch.sum((-probas *
8 (1e-6pred_probas).log()), 1)
9 total_error = (value_error.view(-1)policy_error).mean()
10 return total_error
三是评估 (Evaluation) , 看训练过的智能体,比起正在生成数据的智能体,是不是更优秀了 (最优秀者回到第一步,继续生成数据)。
1def evaluate(player, new_player):
2 results = play(player, opponent=new_player)
3 black_wins = 0
4 white_wins = 0
5、6 for result in results:
7 if result[0] == 1:
8 white_wins= 1
9 elif result[0] == 0:
10 black_wins= 1
11、12 ## Check if the trained player (black) is better than
13 ## the current best player depending on the threshold
14 if black_wins >= EVAL_THRESH * len(results):
15 return True
16 return False
第三部分很重要,要不断选出最优的网络,来不断生成高质量的数据,才能提升AI的棋艺 。
三个环节周而复始 , 才能养成强大的棋手 。
有志于AI围棋的各位,也可以试一试这个PyTorch实现 。
本来摘自量子位,原作 Dylan Djian 。
代码实现传送门:
网页链接
教程原文传送门:
网页链接
AlphaGo Zero论文传送门:
网页链接

推荐阅读