An implementation of the AlphaZero algorithm for the game of Gomoku (Five in a Row), featuring self-play reinforcement learning and Monte Carlo Tree Search.
Training Loss Curve over 22 Iterations, last for 4 days on a single RTX 3090 Ti:
MCTS simulations=2000, cpuct=4.0
TODOs:
- Refactor this project and improve code quality to serve as nano alphazero
- ...
- Complete AlphaZero algorithm implementation with MCTS and policy-value network
- Self-play training with experience replay buffer
- 1cycle learning rate scheduling for stable training
- Arena evaluation mechanism for model selection
- Support for different board sizes (9x9, 15x15)
pip install torch numpy tqdm wandb pygameYou can download the best pre-trained model so far from my huggingface repo and put it in the temp folder.
python alphazero.py --play \
--round=2 \
--player1=human \
--player2=alphazero \
--ckpt_file=best.pth.tar \
--verbosepython alphazero.py --train --wandbnumMCTSSims: Number of MCTS simulations per move (default: 400)numEps: Number of self-play games per iteration (default: 100)maxlenOfQueue: Size of replay buffer (default: 200000)cpuct: Exploration constant in MCTS (default: 1.0)min_lr/max_lr: Learning rate bounds for 1cycle schedule (1e-4 to 1e-2)board_size: Board size, directly affects model size & training speed (default: 9)
alphazero.py: Main implementation including MCTS and neural networkgame.py: Gomoku game logic & rules, including board state rendering, move generation, and game end detection
This project is inspired by and has greatly benefited from the work of schinger/AlphaZero.

