A PyTorch implementation of Supervised and Reinforcement Learning in game-playing according to Alpha Zero
We will be using Miniconda as the environment manager, but you can adapt the steps for any similar tool you might prefer.
Ensure Miniconda is installed on your system. If not installed, you can download it from Miniconda's official website. This project is developed using Python 3.11, so it is advisable to use a compatible version of Miniconda.
Navigate to the root directory of the project file where you can find environment.yml
. This file lists all the necessary packages and their specific versions required to run the application.
Create the Conda environment using the following command:
conda env create -f environment.yml
Activate environment
conda activate sigmazero
If pytorch does not work, try reinstalling it via the following from pytorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Run the Apllication:
Start the application using Streamlit by running:
streamlit run Home.py
-
Dependency Errors:
If you encounter errors related to missing packages or version conflicts, ensure that the environment.yml file includes all necessary dependencies with correct versions. Ensure that you are using the right version of streamlit:
pip install streamlit==1.33.0
-
Environment Activation:
Make sure you activate the correct Conda environment before attempting to run the application. If the environment name is incorrect, check the name specified in the environment.yml file.
Download >2000 ELO Player data for Standard Chess (Blitz and Lightning Chess data will contain non-optimal moves) from FICS.
Place the downloaded .pgn.bz2
in the a saves
folder and set the file path as well as the number of games to generate in generate_training_supervised.py
python generate_training_supervised.py
Train supervised learning model
python train_supervised.py
Reinforcement Learning can be run with the following.
python train_RL.py
Hyperparameters for training can be set with the arguments dictionary in the file
args = {
'C': 2,
'num_searches': 100,
'num_iterations': 3,
'num_selfPlay_iterations': 500,
'num_epochs': 30,
'batch_size': 128,
"start_epoch": 0,
"chess960": True,
}
We test our models against each level of Stockfish in a best of 5 format, results can be found in logs/log.txt
. You can load your model weights and the path to your Stockfish engine in eval.py
You can download the model weights here.
python eval.py
Our network was trained on 15000 >2000 ELO Standard Chess data for 60 epochs and the best results are shown in the table below.
The results are obtained from best of 5 games against different levels of Stockfish, a win is awarded 1 point and a draw is awarded 0.5 point. The Sigmazero AI and Stockfish will take turns to play White and Sigmazero will advance to the next level if it gets 2.5 points or more
StockFish Skill Level | Time Limit | Search Depth | Estimated ELO |
---|---|---|---|
0 | 1 | 5 | 1376 |
1 | 1 | 5 | 1462 |
2 | 1 | 5 | 1547 |
3 | 1 | 5 | 1596 |
4 | 1 | 5 | 1718 |
5 | 1 | 5 | 1804 |
6 | 1 | 5 | 1993 |
7 | 1 | 5 | 2012 |
8 | 1 | 6 | 2127 |
9 | 2 | 7 | 2270 |
20 | 10 | 50 | 3100 |
Model | MCTS Simulations | SF Level | Estimated ELO | Model Win | Model Loss | Model Draw | Games | Score |
---|---|---|---|---|---|---|---|---|
48k_supervised | 800 | 3 | 1596 | 3 | 1 | 0 | WLWW | 3.0 |
48k_supervised | 800 | 4 | 1718 | 3 | 0 | 0 | WWW | 3.0 |
48k_supervised | 800 | 5 | 1804 | 2 | 0 | 1 | DWW | 2.5 |
48k_supervised | 800 | 6 | 1993 | 2 | 0 | 1 | DWW | 2.5 |
48k_supervised | 800 | 7 | 2012 | 3 | 1 | 0 | WLWW | 3.0 |
48k_supervised | 800 | 8 | 2127 | 2 | 1 | 1 | WLDW | 2.5 |
48k_supervised | 800 | 9 | 2270 | 0 | 3 | 2 | LDLDL | 1.0 |
Coming Soon
-
$N_i$ , number of times node has been selected / number of times the node has been through the simulation (integer) -
$W_i$ , the sum of expected value of the node (not an integer, "the number of wins for the node") -
$p$ , policy values of child nodes -
$s$ , representation of board state (8x8xN tensor)
- Selection: Start from root node (current game state) and select successive nodes based on Upper Confidence Bound Criterion (UCB) until a leaf node L is reached (a leaf node is any node that has a potential child from which no simulation has yet been initiated) or a terminal node.
$$\text{UCB} = \frac{w_i}{n_i}+p_ic\frac{\sqrt{N_i}}{1+n_i}$$ , where$c$ is a constant,$p_i$ is the policy of the child node and$n_i$ is its simulation count - Expansion: Unless L ends the game decisively for either player, randomly initialize an unexplored child node.
- Backpropagation: Using the value generated by the neural network
$f_\theta$ , update the N and W values of the current node and all its parent nodes. - Repeat steps 1 to 3 for N iterations
- Self-Play until the game ends using MCTS and
$f_\theta$ - Store the chosen action taken at each state and the values of the node (-1,0,1) depending on the player and whether he won or lost the game. One training sample should contain: (board state s, the action chosen
$\pi$ , the value of the node z) - Minimize loss function of the training samples in the batch.
For the player's perspectives, this is what the tensor will look like. The board will change according to the current player.
White's View:
Black's View:
The board is represented as a (119, 8, 8) tensor, as calculated with MT + L. Where M = 14, T = 8, L = 7.
M represents the number of pieces/planes that are recorded in the board state. In our implementation, we mimicked AlphaZero's implementation of keeping track of all 12 pieces with 2 repetition planes. The order of the planes are as follows:
- White Pawns
- White Knights
- White Bishops
- White Rooks
- White Queens
- White King
- Planes 7 to 12 are the same as 1 to 6, but for black pieces.
- 1-fold repetition plane, a constant binary value
- 2-fold repetition plane
T represents the number of half-steps that are kept track of. In this case we keep track of 8 half-steps, or 4 full turns. The latest update of the half-step is recorded in the first planes.
L is not time-tracked. It is a constant 7 planes that represents special cases of the board regardless of time. The order is as follows:
- Current player's color
- Total Moves that have been played to understand depth
- White King's castling rights
- White Queen's castling rights
- Black King's castling rights
- Black Queen's castling rights
- No progress plane, for 50-move rule
The actions are represented with an 8x8x73 tensor which can be flattened into a 4672 vector. The planes of the tensor represent the location on the board from which the chess piece should be picked up from.
- The first 8x7 channels/planes represent the number of squares to move (1 to 7) for the queen/rook/pawn/bishop/king as well as the direction. (Movement of pawn from 7th rank is assumed to be a promotion to queen)
- The next 8 channels/planes represent the direction to move the knight
- The last 9 channels represent the underpromotion of the pawn to knight, bishop, and rook resp. (through moving one step from the 7th rank or a diagonal capture from the 7th rank).