I describe here my implementation of the AlphaGo Zero algorithm, available on Github,
written in Python with custom Tensorflow GPU operations and a few accessory functions in C for the tree search.
The algorithm has gone through three main iterations, first called AlphaGo, then
improved to not use any pre-training and called
AlphaGo Zero, and finally generalized further to other
games and called
AlphaZero. My implementation is most similar to AlphaZero, however, all variants are relatively similar.
I implemented similar GPU functions for the game Reversi, but as of now I only have my Go GPU code available (I could make the former available, if anyone's interested, though).
My motivation for writing this has been to use the algorithm on other board games using relatively minimal hardware (a single GPU)--as a result,
I have not attempted to implement the algorithm word-for-word (for example,
all games are played with a pre-set number of turns instead of using an automated procedure for having the algorithm auto-resign near the end-game).
I release this code, descriptions, and all other material in the public domain. If you find it useful, I'd be happy hear about it. I'd also be happy to help, to the extent that I can and time allows, if you have any problems understanding my code. My email is X@Y.com. Where X is cody2007 and Y is protonmail.com.
How good is the network? Not that great--I can beat it essentially all the time and I'm a complete novice at Go. However, the implementation does seem to work--its score against GNU Go does steadily increase while training (the network starts to beat GNU Go about 50% of the time for 20-turn games using a simple scoring system) and it can at least beat a completely random player about 100% of the time.
One thing though is that it takes shortcuts to abuse the scoring system I used. I score the game by counting all the stones each player has on the board plus all the empty positions that each player has completely surrounded. For this reason, the neural network seems to like (especially near the end-game) placing stones in territories safely controlled by the other player. This substantially reduces the score of the other player (which would no longer get any points for that territory aside from the number of stones on the border) even though the strategy wouldn't hold up if games were played out beyond 20 turns. See the section below about improving performance.
This section assumes you have already installed Python 2.7, Tensorflow, the Nvidia toolkit, the GNU C compiler (GCC) and standard development environment, and some basic Python packages: numpy, scipy, colorama (todo). It may also be useful to install and configure an Ipython (Jupyter) Notebook server for visualization of the training progress (described more below). If you'd like to play against the network, graphically, you might want to also install the Pygame Python package (also described more below).
All code has been tested and written on Ubuntu 18.04 using Python 2.7, Tensorflow v1.7.0 and compilled with NVCC V9.0.176 (the Nvidia Cuda compiler). I run and test all code on a single Nvidia Titan X card, with a quad-core Intel i5-6600K CPU @ 3.50GHz, and 16 Gb of RAM. I have not tested it on alternative configurations, but suspect it'll likely compile and run with different versions of the Cuda toolkit and Tensorflow.
cd ~/alpha_go_zero_implementation
./build.sh
-arch=sm_52
flag in the build.sh script to match the compute capacity of your card.
You can find the compute capacity of your card, probably in several ways, but Tensorflow can easily tell you--in Python you can run:
import tensorflow as tf
sess = tf.InteractiveSession()
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
The code, as currently configured, requires approximately 7.8 Gb of RAM (almost all of this is from the tree search buffers).
However, this memory usage can be scaled back (or up if you want to do larger searches)
by adjusting the constants TREE_BUFFER_SZ
and MV_BUFFER_SZ
in py_util/includes.h (at
the expense of requiring N_SIM
, controlling the number of simulated games per turn used to construct the tree,
in bp_tree.py to also be reduced due to decreased size of the tree buffers).
cd ~/alpha_go_zero_implementation
./run.sh
save_nm
at the top of the script. If save_nm
is set to None
the script
will start from scratch. So, when starting to train a new model, you should update save_nm
to indicate the new model file name once
you start running the script initially. If you don't, it will start again from scratch each time the script (auto) restarts. In theory, this
could be fixed by changing the code to recover from the tree buffer overflow instead of simply terminating, but I've not found it enough of a nusiance to justify the time
of implementing that behavior just yet.
screen -r
to reconnect).
save_nm
in the script to the name of the model file. You can two play
the network in two modes: running the network only (fastest, no tree search), or with the tree search.
Setting run_net
to True
allows you to play without the tree search,
and setting it to False
enables the tree search.
[BATCH_SZ, MAP_SZ_X, MAP_SZ_Y]
.board
represents the current game boards being evaluated.board2
represents a backup of the game for use while a tree search evaluation is occuring (code for backing up and restoring the board is in
kernels/session_backup.cu.cc)board_prev
and board_pprev
represent the board one and two turns before the current, to prevent repetition of moves (and again
board_prev2
, and board_pprev2
are backups similar to board2
)n_captures
and n_captures2
(shape [N_PLAYERS, BATCH_SZ]
) indicate the number of captures each player made in the current game. This is for statistical use only, not actually needed
for the algorithm.ai_to_coord
(shape [BATCH_SZ]
) are the coordinates chosen by the random player and also not needed for the algorithm (it is the output of
kernels/move_random_ai.cu and input into kernels/move_unit.cu)valid_mv_map_internal
(shape [BATCH_SZ, MAP_SZ]
) this is a binary map for each board indicating which positions are valid moves (1) and which are not (0). It is the output of
kernels/create_batch.cu
and used by kernels/move_unit.cu to ensure only valid moves are made.
tree_sz
(shape [BATCH_SZ]
) represents the number of nodes (with children). It should never exceed
TREE_BUFFER_SZ
or else we have a memory overflow.tree_start
(shape [BATCH_SZ]
) is the index of the node representing the current board state (it should always be >= 0 and less than the tree_sz).
tree_player
(shape [BATCH_SZ, TREE_BUFFER_SZ]
) represents, for each node, the player who next makes a move.tree_parent
(shape [BATCH_SZ, TREE_BUFFER_SZ]
) represents, for each node, the parent node index. A value of -1 indicates we are at the root of the tree.
tree_list_sz
(shape [BATCH_SZ, TREE_BUFFER_SZ]
) represents, for each node (game state), the number of children (possible moves) available. For all initialized nodes,
this value will always be at least one (indicating no move) and never exceeding 1 + MAP_SZ
(it's not possible to have more valid moves than the board size).tree_list_start
(shape [BATCH_SZ, TREE_BUFFER_SZ]
) represents, for each node, the index where we can find the children and their statistics (visit counts, sum of Q values). This should never
exceed MV_BUFFER_SZ
or we have a memory overflow.
tree_list_start
) within this buffer to a contiguous set of entries
(tree_list_sz
represents the number of entries).
list_sz
(shape BATCH_SZ
) contains the number of active elements in the list. As the list grows, elements are added at the end and the
list_sz
entry for each game is appropriately incremented. It should never exceed MV_BUFFER_SZ
for any given game, because this is the size of the list buffers described below.
list_valid_mv_inds
(shape [BATCH_SZ, MV_BUFFER_SZ]
) for each game, it contains the valid move indices for the player who next makes a move.list_valid_tree_inds
(shape [BATCH_SZ, MV_BUFFER_SZ]
) for each game, it contains the valid node indices (it's a pointer into the structures described above--variables starting with tree_
).
With the node index, one can then lookup the next set of leaves from the tree_list_start
index. list_valid_tree_inds
will be
-1 if it is unitialized. When the network moves to an unitialized node, it will be initilized by the py_util/register_mv.c function.list_q_total
(shape [BATCH_SZ, MV_BUFFER_SZ]
) for each game, it represents the sum total of Q values from all states visited here and forward in the game. It's divided by the visit count to arrive at the mean.list_prob
(shape [BATCH_SZ, MV_BUFFER_SZ]
) for each game, it contains the network's estimated probability of winning the game. This value only needs to be set once (unlike the visit count and mean Q values) and
it is set in py_util/choose_moves.c.list_visit_count
(shape [BATCH_SZ, MV_BUFFER_SZ]
) for each game, it contains the number of times this move and any move following it is played.import tensorflow as tf import numpy as np import time import os import copy import random import global_vars as gv from scipy.stats import pearsonr from colorama import Fore, Style from datetime import datetime import architectures.tree_tf_op as arch import py_util.py_util as pu from scipy.special import gamma import gnu_go_test as gt sdir = 'models/' # directory to save and load models
save_nm
variable).
Also, we create a list of variables that we want to save in our model file and variables we will use for logging performance.################################### configuration: #### load previous model or start from scratch? #save_nm = None # this results in the optimization starting from scratch (comment out line below) save_nm = 'go_cpu_tree_0.200000EPS_7GMSZ_1000N_SIM_0.001000L2_LAMBDA_0.900000MOMENTUM_0.025000VAL_LAMBDA_1.000000CPUCT_20N_TURNS_128N_FILTERS_EPS0.110000_EPS0.020000.npy' ###### variables to save save_vars = ['LSQ_LAMBDA', 'LSQ_REG_LAMBDA', 'POL_CROSS_ENTROP_LAMBDA', 'VAL_LAMBDA', 'VALR_LAMBDA', 'L2_LAMBDA', 'FILTER_SZS', 'STRIDES', 'N_FILTERS', 'N_FC1', 'EPS', 'MOMENTUM', 'SAVE_FREQ', 'N_SIM', 'N_TURNS', 'CPUCT', 'N_EVAL_NN_GMS', 'N_EVAL_NN_GNU_GMS', 'N_EVAL_TREE_GMS', 'N_EVAL_TREE_GNU_GMS', 'CHKP_FREQ', 'save_nm', 'DIR_A', 'start_time', 'EVAL_FREQ', 'boards', 'scores'] logs = ['val_mean_sq_err', 'pol_cross_entrop', 'pol_max_pre', 'pol_max', 'val_pearsonr','opt_batch','eval_batch'] print_logs = ['val_mean_sq_err', 'pol_cross_entrop', 'pol_max', 'val_pearsonr'] for nm in ['tree', 'nn']: for suffix in ['', '_gnu']: for key in ['win', 'n_captures', 'n_captures_opp', 'score', 'n_mvs', 'boards']: logs += ['%s_%s%s' % (key, nm, suffix)] state_vars = ['log', 'run_time', 'global_batch', 'global_batch_saved', 'global_batch_evald', 'save_counter','boards', 'save_t'] # updated each save
EPS
variable in the else
branch of the if
statement.
################### start from scratch restore = False if save_nm is None: ##### weightings on individual loss terms: LSQ_LAMBDA = 0 LSQ_REG_LAMBDA = 0 POL_CROSS_ENTROP_LAMBDA = 1 VAL_LAMBDA = .025 #.05 VALR_LAMBDA = 0 L2_LAMBDA = 1e-3 # weight regularization DIR_A = 0 CPUCT = 1 ##### model parameters N_LAYERS = 5 # number of model layers FILTER_SZS = [3]*N_LAYERS STRIDES = [1]*N_LAYERS F = 128 # number of filters N_FILTERS = [F]*N_LAYERS N_FC1 = 128 # number of units in fully connected layer EPS = 2e-1 # backprop step size MOMENTUM = .9 N_SIM = 1000 # number of simulations at each turn N_TURNS = 20 # number of moves per player per game ##### number of batch evaluations for testing model N_EVAL_NN_GMS = 1 # model evaluation for printing N_EVAL_NN_GNU_GMS = 1 N_EVAL_TREE_GMS = 0 # model eval N_EVAL_TREE_GNU_GMS = 0 ######### save and checkpoint frequency SAVE_FREQ = N_TURNS*1 EVAL_FREQ = SAVE_FREQ*1 CHKP_FREQ = 60*60 start_time = datetime.now() save_t = datetime.now() save_nm = 'go_cpu_tree_%fEPS_%iGMSZ_%iN_SIM_%fL2_LAMBDA_%fMOMENTUM_%fVAL_LAMBDA_%fCPUCT_%iN_TURNS_%iN_FILTERS.npy' % (EPS, gv.n_rows, N_SIM, L2_LAMBDA, MOMENTUM, VAL_LAMBDA, CPUCT, N_TURNS, N_FILTERS[0]) boards = {}; scores = {} # eval save_d = {} for key in save_vars: exec('save_d["%s"] = %s' % (key,key)) save_d['script_nm'] = __file__ global_batch = 0 global_batch_saved = 0 global_batch_evald = 0 save_counter = 0 log = {} for key in logs: log[key] = [] else: restore = True save_d = np.load(sdir + save_nm).item() for key in save_vars + state_vars: if key == 'save_nm': continue exec('%s = save_d["%s"]' % (key,key)) EPS_ORIG = EPS EPS = 2e-2 save_nm_orig = save_nm # append new learning rate to file name if not already added append_txt = '_EPS%f.npy' % EPS if EPS_ORIG != EPS and save_nm.find(append_txt) == -1: save_nm = save_nm.split('.npy')[0] + append_txt print 'saving to:' print save_nm ########################################################################################
def run_sim(turn): # simulate game forward arch.sess.run(arch.session_backup) pu.session_backup()
pol
variable in the script). We also get its Q value estimate
of the current game state (the val
variable in the script), and the list of valid moves the current player can make (generated by the
kernels/create_batch.cu function). If we are not at the first move of the simulation, then the script will backup the Q value
to all game states that led to the current one.
for sim in range(N_SIM): # backup then make next move for turn_sim in range(turn, N_TURNS+1): for player in [0,1]: # get valid moves, network policy and value estimates: valid_mv_map, pol, val = arch.sess.run([arch.valid_mv_map, arch.pol, arch.val], feed_dict=ret_d(player)) # backup visit Q values if turn_sim != turn: pu.backup_visit(player, val)
pu.add_valid_mvs
function which is from the Python C module compiled above.
Next, the tree is quered to find the next move based on the networks probability and Q value estimates. We then register the moves in the tree and then make the move (which
updates the GPU memory structures containing the board state with the line beginning: arch.sess.run
which ultimately calls the
kernels/move_unit.cu function).
pu.add_valid_mvs(player, valid_mv_map) # register valid moves in tree to_coords = pu.choose_moves(player, pol, CPUCT)[0] # choose moves based on policy and Q values (latter of which already stored in tree) pu.register_mv(player, to_coords) # register move in tree arch.sess.run(arch.move_frm_inputs, feed_dict={arch.moving_player: player, arch.to_coords_input: to_coords}) # move network (update GPU vars)
# backup terminal state for player in [0,1]: winner = arch.sess.run(arch.winner, feed_dict=ret_d(player)) pu.backup_visit(player, winner) # return move back to previous node in tree arch.sess.run(arch.session_restore) pu.session_restore()
N_SIM
(ex. 1000) games. Then for each player we save the board state (for use later
as training inputs), and get the
list of valid moves.
######################################### training loop: while True: ######### generate batches buffer_loc = 0 arch.sess.run(arch.init_state) pu.init_tree() turn_start_t = time.time() for turn in range(N_TURNS): run_sim(turn) ### make move for player in [0,1]: inds = buffer_loc + np.arange(gv.BATCH_SZ) board[inds], valid_mv_map, pol = arch.sess.run([arch.imgs, arch.valid_mv_map, arch.pol], feed_dict = ret_d(player)) # generate batch and valid moves
run_sim
function, we register the valid moves in the tree, and then use the tree to choose the current player's next move.
However, this time instead of choosing the move with the maximum tree search value (which reflects the visit count, Q values, and network probability estimates), we
move probabilistically in proportion to the number of times the tree search visited each move. Then, as before, we register the move in the tree. Finally,
we prune the tree (py_util/prune_tree.c), which means we remove all the nodes and leaves which are no longer accessible (those that represent branches
that are no longer possible from the current move).
######### pu.add_valid_mvs(player, valid_mv_map) # register valid moves in tree visit_count_map = pu.choose_moves(player, pol, CPUCT)[-1] # get number of times each node was visited to_coords = arch.sess.run([arch.tree_prob_visit_coord, arch.tree_prob_move_unit], feed_dict={arch.moving_player: player, arch.visit_count_map: visit_count_map, arch.dir_pre: dir_pre, arch.dir_a: DIR_A})[0] # make move in proportion to visit counts pu.register_mv(player, to_coords) # register move in tree ############### buffer_loc += gv.BATCH_SZ pu.prune_tree()
val
variable in the script) and determine
the probability that the tree search would choose each move for a given game state (which we also train the network to predict--the pol
variable in the script).
The py_util/return_probs_map.c function returns the probabilities for each turn, player, game, and move from our previous simulations. Its dimensions are:
[N_TURNS, N_PLAYERS, BATCH_SZ, MAP_SZ_X, MAP_SZ_Y]
.
##### create prob maps for player in [0,1]: winner[:, player] = arch.sess.run(arch.winner, feed_dict={arch.moving_player: player}) tree_probs = pu.return_probs_map()
############################# # train random.shuffle(inds_total) for batch in range(N_TURNS): inds = inds_total[batch*gv.BATCH_SZ + np.arange(gv.BATCH_SZ)] board2, tree_probs2 = pu.rotate_reflect_imgs(board[inds], tree_probs[inds]) # rotate and reflect board randomly
train_dict = {arch.imgs: board2, arch.pol_target: tree_probs2, arch.val_target: winner.ravel()[inds]} val_mean_sq_err_tmp, pol_cross_entrop_err_tmp, val_pearsonr_tmp = arch.sess.run(bp_eval_nodes, feed_dict=train_dict)[1:] # update logs val_mean_sq_err += val_mean_sq_err_tmp pol_cross_entrop_err += pol_cross_entrop_err_tmp val_pearsonr += val_pearsonr_tmp global_batch += 1 err_denom += 1