Arcane Fortune

About  -  Screenshots  -  How to play  -  Wiki  -  Download  -  Video dev logs  -  AI  -  Forum  -  Donate

AlphaGo Zero implementation and tutorial

I describe here my implementation of the AlphaGo Zero algorithm, available on Github, written in Python with custom Tensorflow GPU operations and a few accessory functions in C for the tree search. The algorithm has gone through three main iterations, first called AlphaGo, then improved to not use any pre-training and called AlphaGo Zero, and finally generalized further to other games and called AlphaZero. My implementation is most similar to AlphaZero, however, all variants are relatively similar. I implemented similar GPU functions for the game Reversi, but as of now I only have my Go GPU code available (I could make the former available, if anyone's interested, though).

My motivation for writing this has been to use the algorithm on other board games using relatively minimal hardware (a single GPU)--as a result, I have not attempted to implement the algorithm word-for-word (for example, all games are played with a pre-set number of turns instead of using an automated procedure for having the algorithm auto-resign near the end-game).

I release this code, descriptions, and all other material in the public domain. If you find it useful, I'd be happy hear about it. I'd also be happy to help, to the extent that I can and time allows, if you have any problems understanding my code. My email is Where X is cody2007 and Y is

How good is the network? Not that great--I can beat it essentially all the time and I'm a complete novice at Go. However, the implementation does seem to work--its score against GNU Go does steadily increase while training (the network starts to beat GNU Go about 50% of the time for 20-turn games using a simple scoring system) and it can at least beat a completely random player about 100% of the time.

One thing though is that it takes shortcuts to abuse the scoring system I used. I score the game by counting all the stones each player has on the board plus all the empty positions that each player has completely surrounded. For this reason, the neural network seems to like (especially near the end-game) placing stones in territories safely controlled by the other player. This substantially reduces the score of the other player (which would no longer get any points for that territory aside from the number of stones on the border) even though the strategy wouldn't hold up if games were played out beyond 20 turns. See the section below about improving performance.

Table of contents

Compiling and running the code

This section assumes you have already installed Python 2.7, Tensorflow, the Nvidia toolkit, the GNU C compiler (GCC) and standard development environment, and some basic Python packages: numpy, scipy, colorama (todo). It may also be useful to install and configure an Ipython (Jupyter) Notebook server for visualization of the training progress (described more below). If you'd like to play against the network, graphically, you might want to also install the Pygame Python package (also described more below).

All code has been tested and written on Ubuntu 18.04 using Python 2.7, Tensorflow v1.7.0 and compilled with NVCC V9.0.176 (the Nvidia Cuda compiler). I run and test all code on a single Nvidia Titan X card, with a quad-core Intel i5-6600K CPU @ 3.50GHz, and 16 Gb of RAM. I have not tested it on alternative configurations, but suspect it'll likely compile and run with different versions of the Cuda toolkit and Tensorflow.

Compiling the code

Compilation involves compiling the Tensorflow operations and the Python C module. I've written the compilation commands into the script So, to compile, you should be able to change directories to where you downloaded the code, and simply run it:

cd ~/alpha_go_zero_implementation

If you have an older Nvidia GPU card than mine (or even if you have a newer one and want better performance), you'll need to adjust the -arch=sm_52 flag in the script to match the compute capacity of your card. You can find the compute capacity of your card, probably in several ways, but Tensorflow can easily tell you--in Python you can run:

import tensorflow as tf
sess = tf.InteractiveSession()

Which generates a decent amount of text, but if you look at the last line or so, it should look something like this:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)

(Notice the compute capacity statement above, so "sm_52" is the best option for me).

Adjusting the code's memory requirements (if needed)

The code, as currently configured, requires approximately 7.8 Gb of RAM (almost all of this is from the tree search buffers). However, this memory usage can be scaled back (or up if you want to do larger searches) by adjusting the constants TREE_BUFFER_SZ and MV_BUFFER_SZ in py_util/includes.h (at the expense of requiring N_SIM, controlling the number of simulated games per turn used to construct the tree, in to also be reduced due to decreased size of the tree buffers).

Running the code

Occasionally, the tree search exceeds the buffer size, at which time the code should detect this and terminate. For this reason, I created a shell script, called to run the main training script, in a loop-- if the training terminates, it will restart it again. So from the command line, you would change your current directory to where you downloaded the code and execute For example:

cd ~/alpha_go_zero_implementation

Because of the script occassionally terminating, you need to be careful if you start training a new model. works by loading the model file specified in the variable save_nm at the top of the script. If save_nm is set to None the script will start from scratch. So, when starting to train a new model, you should update save_nm to indicate the new model file name once you start running the script initially. If you don't, it will start again from scratch each time the script (auto) restarts. In theory, this could be fixed by changing the code to recover from the tree buffer overflow instead of simply terminating, but I've not found it enough of a nusiance to justify the time of implementing that behavior just yet.

While not necessary, I often run all of this in a GNU Screen session so that I can close my console and have the code continue running (use screen -r to reconnect).

Visualizing the algorithm training

I've included a Jupyter notebook for visualizing network training progress and also plotting example games played by the neural network (see the file notebooks/training_visualizations.ipynb). Below I show example outputs in the notebook.

Black = neural network; white = opponent
The network playing against a random opponent and GNU Go

Black = neural network; white = opponent
The end game board positions of the network playing against a random opponent and GNU Go

Playing the trained network

The script allows you to play the network via a simple GUI using PyGame. Set the variable save_nm in the script to the name of the model file. You can two play the network in two modes: running the network only (fastest, no tree search), or with the tree search. Setting run_net to True allows you to play without the tree search, and setting it to False enables the tree search.

Overview of the algorithm

The best description of the algorithm is directly from the original papers (AlphaGo, AlphaGo Zero, and Alpha Zero) however, I'll briefly re-summarize. The idea is to use a tree search guided by a neural network which is trained to estimate likely good moves. A tree search is simply a brute-force approach where at each move, multiple games are played out from the current position to the end-game to determine how good a particular move is. The tree search chooses moves based on the output of the neural network and as the neural network is trained, the tree search becomes more focused at exploring successful stradegies. How is the neural network trained? It is trained to estimate a set of summary statistics from the tree search (how often each move was visited by the tree search, and how likely each move resulted in the network winning the game). During training, the network plays against itself and requires no human training.

Code overview

I've placed all code governing the Go rules and movement in tensorflow GPU operations (in the subdirectory kernels/). The tree structures I've kept on the CPU (in the subdirectory py_util/). I previously implemented the tree search also as tensflow GPU operations, but decided to move it to the CPU for two reasons. The first is that I was starting to run into GPU memory limitations when running tree searches greater than 1000 simulations. The second is that I've found about a 2x speed improvement keeping the tree search code on the CPU (although I had not invested much time in optimizing the GPU tree search implementation).

GPU game data-structures

All GPU data structures are in row-major-order and are defined in cuda_includes.h.

The board is stored with dimensions [BATCH_SZ, MAP_SZ_X, MAP_SZ_Y].

CPU tree data-structures

All CPU tree data structures (used within the py_util/_py_util.c Python module) are also in row-major-order. They are definied in py_util/includes.h and are pre-allocated to a fixed size to avoid needing to constantly allocate and free memory. Care must be taken to not exceed the buffer sizes (the code terminates if so).

Node data structures:
Each node represents a board state and contains a pointer to a list of: valid moves, their visit counts, mean Q value, and the probability of winning assigned by the neural network. These values (visit counts and mean Q value) are updated each time the network makes a move--all game states leading to the current state are updated via the py_util/backup_visit.c function.

List data structures:
As mentioned above, each game state (which I call a node) is associated with a list of valid moves, visit counts, mean Q values, and probabilities. These structures are each stored in a single large buffer, and the node data structures contain a pointer (tree_list_start) within this buffer to a contiguous set of entries (tree_list_sz represents the number of entries).

Model architecture

All my testing has focused on a 5 layer residual, convolutional neural network with 128 filters at each layer (although these parameters can be easily changed by altering This is a smaller network than that reported in the AlphaGo papers, and was influenced by my own hardware and timing constraints.

Walkthrough of the training script

Here is a walk-through the training script, Generally, it sets up and loads general variables and state information. Then enters the main loop where it will first generate training batches by self-play, then it will train on them in a randomly shuffled order in addition to randomly reflecting and rotating the board. At the end of the loop, it will then save and evaluate the network, and then the cycle is repeated.

The first part of the script is simply importing modules and setting the directory where the neural network model is to be loaded and saved. Two files worth mentioning are the file which contains, for example, the batch size, and the architectures/ file which constructs the neural network in Tensorflow using parameters will define later in the present script.

import tensorflow as tf
import numpy as np
import time
import os
import copy
import random
import global_vars as gv
from scipy.stats import pearsonr
from colorama import Fore, Style
from datetime import datetime
import architectures.tree_tf_op as arch
import py_util.py_util as pu
from scipy.special import gamma
import gnu_go_test as gt
sdir = 'models/' # directory to save and load models

Next, we set if we want the script to start training a new network or load and resume training an existing one (by setting the save_nm variable). Also, we create a list of variables that we want to save in our model file and variables we will use for logging performance.

################################### configuration: 
#### load previous model or start from scratch?
#save_nm = None # this results in the optimization starting from scratch (comment out line below)
save_nm = 'go_cpu_tree_0.200000EPS_7GMSZ_1000N_SIM_0.001000L2_LAMBDA_0.900000MOMENTUM_0.025000VAL_LAMBDA_1.000000CPUCT_20N_TURNS_128N_FILTERS_EPS0.110000_EPS0.020000.npy'

###### variables to save
'save_nm', 'DIR_A', 'start_time', 'EVAL_FREQ', 'boards', 'scores']
logs = ['val_mean_sq_err', 'pol_cross_entrop', 'pol_max_pre', 'pol_max', 'val_pearsonr','opt_batch','eval_batch']
print_logs = ['val_mean_sq_err', 'pol_cross_entrop', 'pol_max', 'val_pearsonr']

for nm in ['tree', 'nn']:
	for suffix in ['', '_gnu']:
		for key in ['win', 'n_captures', 'n_captures_opp', 'score', 'n_mvs', 'boards']:
			logs += ['%s_%s%s' % (key, nm, suffix)]

state_vars = ['log', 'run_time', 'global_batch', 'global_batch_saved', 'global_batch_evald', 'save_counter','boards', 'save_t'] # updated each save

Now we either initialize a new model or load it. This is where you'd change model or training parameters, such as the number of layers, filters, loss function weightings, or gradient descent step size. Also, if you ever want to reduce the gradient descent step size, you'd set the EPS variable in the else branch of the if statement.

################### start from scratch
restore = False
if save_nm is None:
	##### weightings on individual loss terms:
	VAL_LAMBDA = .025 #.05
	L2_LAMBDA = 1e-3 # weight regularization 
	DIR_A = 0
	CPUCT = 1

	##### model parameters
	N_LAYERS = 5 # number of model layers
	F = 128 # number of filters
	N_FC1 = 128 # number of units in fully connected layer
	EPS = 2e-1 # backprop step size

	N_SIM = 1000 # number of simulations at each turn
	N_TURNS = 20 # number of moves per player per game

	##### number of batch evaluations for testing model
	N_EVAL_NN_GMS = 1 # model evaluation for printing
	N_EVAL_TREE_GMS = 0 # model eval

	######### save and checkpoint frequency
	CHKP_FREQ = 60*60

	start_time =
	save_t =


	boards = {}; scores = {} # eval
	save_d = {}
	for key in save_vars:
		exec('save_d["%s"] = %s' % (key,key))
	save_d['script_nm'] = __file__

	global_batch = 0
	global_batch_saved = 0
	global_batch_evald = 0
	save_counter = 0

	log = {}
	for key in logs:
		log[key] = []

	restore = True
	save_d = np.load(sdir + save_nm).item()

	for key in save_vars + state_vars:
		if key == 'save_nm':
		exec('%s = save_d["%s"]' % (key,key))

	EPS = 2e-2

	save_nm_orig = save_nm

	# append new learning rate to file name if not already added
	append_txt = '_EPS%f.npy' % EPS
	if EPS_ORIG != EPS and save_nm.find(append_txt) == -1:
		save_nm = save_nm.split('.npy')[0] + append_txt	
	print 'saving to:'
	print save_nm


Next come functions to save and restore the current tree position (ommitted here on this webpage for brevity), and a simple function for generating the dictionary for input into Tensorflow. After, comes the main function for executing the tree search. First the current tree state is backed up:

def run_sim(turn): # simulate game forward

Then, we enter the main simulation and turn loops for which each player makes a move. We then run the neural network forward to get its estimate of the probability each move will result in a winning game for it (the pol variable in the script). We also get its Q value estimate of the current game state (the val variable in the script), and the list of valid moves the current player can make (generated by the kernels/ function). If we are not at the first move of the simulation, then the script will backup the Q value to all game states that led to the current one.

	for sim in range(N_SIM):
		# backup then make next move
		for turn_sim in range(turn, N_TURNS+1):
			for player in [0,1]:
				# get valid moves, network policy and value estimates:
				valid_mv_map, pol, val =[arch.valid_mv_map, arch.pol, arch.val], feed_dict=ret_d(player))

				# backup visit Q values
				if turn_sim != turn:
					pu.backup_visit(player, val)

With the list of valid moves, we then update the tree by calling the pu.add_valid_mvs function which is from the Python C module compiled above. Next, the tree is quered to find the next move based on the networks probability and Q value estimates. We then register the moves in the tree and then make the move (which updates the GPU memory structures containing the board state with the line beginning: which ultimately calls the kernels/ function).

				pu.add_valid_mvs(player, valid_mv_map) # register valid moves in tree
				to_coords = pu.choose_moves(player, pol, CPUCT)[0] # choose moves based on policy and Q values (latter of which already stored in tree)
				pu.register_mv(player, to_coords) # register move in tree, feed_dict={arch.moving_player: player, arch.to_coords_input: to_coords}) # move network (update GPU vars)

Finally, at the end of each simulation we must backup who won each game (the network is playing itself, but this is necessary to know which set of moves won for training). We also restore our previous state of where we were in the tree.

		# backup terminal state
		for player in [0,1]:
			winner =, feed_dict=ret_d(player))
			pu.backup_visit(player, winner)
		# return move back to previous node in tree

Next comes some additional initialization code (not shown here). After, comes the main training loop. First we call the kernels/ Tensorflow operation which initializes a new set of game boards on the GPU. We also initialize the tree data structures by calling py_util/init_tree.c. Next, for each turn we run our simulation code to play out N_SIM (ex. 1000) games. Then for each player we save the board state (for use later as training inputs), and get the list of valid moves.

######################################### training loop:
while True:
	######### generate batches
	buffer_loc = 0
	turn_start_t = time.time()
	for turn in range(N_TURNS):
		### make move
		for player in [0,1]:
			inds = buffer_loc + np.arange(gv.BATCH_SZ)
			board[inds], valid_mv_map, pol =[arch.imgs, arch.valid_mv_map, arch.pol], feed_dict = ret_d(player)) # generate batch and valid moves

As before, with the run_sim function, we register the valid moves in the tree, and then use the tree to choose the current player's next move. However, this time instead of choosing the move with the maximum tree search value (which reflects the visit count, Q values, and network probability estimates), we move probabilistically in proportion to the number of times the tree search visited each move. Then, as before, we register the move in the tree. Finally, we prune the tree (py_util/prune_tree.c), which means we remove all the nodes and leaves which are no longer accessible (those that represent branches that are no longer possible from the current move).

			pu.add_valid_mvs(player, valid_mv_map) # register valid moves in tree
			visit_count_map = pu.choose_moves(player, pol, CPUCT)[-1] # get number of times each node was visited
			to_coords =[arch.tree_prob_visit_coord, arch.tree_prob_move_unit], feed_dict={arch.moving_player: player, 
				arch.visit_count_map: visit_count_map, arch.dir_pre: dir_pre, arch.dir_a: DIR_A})[0] # make move in proportion to visit counts

			pu.register_mv(player, to_coords) # register move in tree

			buffer_loc += gv.BATCH_SZ


Now, for the purpose of constructing our training batches, we determine who won each game (which we train the network to predict--the Q value, or val variable in the script) and determine the probability that the tree search would choose each move for a given game state (which we also train the network to predict--the pol variable in the script). The py_util/return_probs_map.c function returns the probabilities for each turn, player, game, and move from our previous simulations. Its dimensions are: [N_TURNS, N_PLAYERS, BATCH_SZ, MAP_SZ_X, MAP_SZ_Y].

	##### create prob maps
	for player in [0,1]:
		winner[:, player] =, feed_dict={arch.moving_player: player})
	tree_probs = pu.return_probs_map()

Finally, we are ready to train the network. We shuffle the order of training steps and then rotate and reflect the board, which gives the network a larger variety of inputs to train on and prevents it from overfitting to any one game.

	# train
	for batch in range(N_TURNS):
		inds = inds_total[batch*gv.BATCH_SZ + np.arange(gv.BATCH_SZ)]
		board2, tree_probs2 = pu.rotate_reflect_imgs(board[inds], tree_probs[inds]) # rotate and reflect board randomly

We then perform the actual backpropagation steps. For diagnostics, we load in the loss function values to monitor training progress. These values are then incremented in temporary variables printed, saved, and reset later on in the script.

		train_dict = {arch.imgs: board2,
				arch.pol_target: tree_probs2,
				arch.val_target: winner.ravel()[inds]}

		val_mean_sq_err_tmp, pol_cross_entrop_err_tmp, val_pearsonr_tmp =, feed_dict=train_dict)[1:]

		# update logs
		val_mean_sq_err += val_mean_sq_err_tmp
		pol_cross_entrop_err += pol_cross_entrop_err_tmp
		val_pearsonr += val_pearsonr_tmp
		global_batch += 1
		err_denom += 1

The remainder of the script (not shown here) will then save the model and evaluate it against a random opponent and the GNU Go algorithm. Also included in that code are print statements for seeing some of the loss function outputs.

Training speed

On my setup (see the top of the page for my hardware description), it takes about 40 minutes to generate 128 games played to 20 turns at 1000 simulations per turn per game. My code randomly rotates and reflects the board and trains on each move made for all turns (so in 40 minutes it generates 20 training batches). The plots I show on this page represent about one month of training [ (20000 training steps * (40 minutes / 20 training steps))/(60 minutes / hours * 24 hours / day )] = 27.8 days.

Improving performance further

I've noticed improved performance when increasing the number of game simulations from 500 to 1000 (see below; blue = 1000 simulations; red = 500 simulations). Notice the modest increase in winning rates against the GNU Go algorithm. I suspect additional increases in simulations may also improve performance, but I've not tested this. Oddly the AlphaZero paper only used 800 simulations, however, there were other differences that may've gotten them better performance such as hyperparameter tuning (which I've not done systematically), and increased network size. On some older tests I've done with the game Reversi, I noticed a more prominent increase in performance by increasing the number of simulations, and modest to no improvement with increased network size (layers and number of filters). These results may not apply to Go, though, given the state space and number of moves per turn are drastically increased. If anyone else has thoughts on this, I'd be happy to hear (my email is at the top of the page).

Comparing 1000 vs. 500 simulations

blue = 1000 simulations; red = 500 simulations
(Darker colors are games played against GNU Go, lighter colors are against an opponent making random moves)

Modified Dec-7-2018

Contact me