Hybrid reinforcement learning and minimax agent for the Tablut Challenge. Combines PPO-trained value networks with alpha-beta search for competitive play.
The agent uses a hybrid approach:
- Minimax search with alpha-beta pruning and iterative deepening
- PPO value network trained through self-play (5M timesteps) for leaf evaluation
- Heuristics for move ordering and fallback evaluation
- TCP socket communication with Java referee server (JSON protocol)
Prerequisites: Python 3.10+, Java Runtime (for referee server)
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtThe easiest way to launch your agent is using the runmyplayer.sh script. This script automatically activates the virtual environment and runs the agent with the trained PPO model.
Make sure the Java server is running first!
In a separate terminal, start the Java server:
cd Tablut/Executables
java -jar Server.jarOr with GUI (recommended for visualization):
cd Tablut/Executables
java -jar Server.jar -gOnce the server is running, use the runmyplayer.sh script to launch your agent:
./runmyplayer.sh white 60 127.0.0.1Command format:
./runmyplayer.sh <player> <timeout> <server_ip>Parameters:
player:whiteorblack(case-insensitive)timeout: Time limit per move in seconds (e.g.,60)server_ip: Server IP address (use127.0.0.1for localhost)
Examples:
# Play as white player
./runmyplayer.sh white 60 127.0.0.1
# Play as black player
./runmyplayer.sh black 60 127.0.0.1
# With different timeout
./runmyplayer.sh white 120 127.0.0.1Note: The script automatically:
- Activates the virtual environment
- Loads the trained PPO model (
models/rl_value_net_5M.zip) - Sets search depth to 4
- Connects to the appropriate port (5800 for white, 5801 for black)
Basic (heuristics only):
python -m python_client.agent WHITE 60 127.0.0.1With PPO model:
python -m python_client.agent WHITE 60 127.0.0.1 --model models/rl_value_net_5M.zip --depth 4Parameters:
player: WHITE or BLACKtimeout: Time limit per move (seconds)server_ip: Referee server address--model: Path to PPO model (.zip file)--depth: Minimax search depth (default: 4)--seed: Random seed for reproducibility
PPO training through self-play. The value function is extracted from the trained policy and used for minimax leaf evaluation.
pip install stable-baselines3
# 5M timesteps (recommended for competition)
python -m python_client.trainer \
--algo ppo \
--timesteps 5000000 \
--save models/rl_value_net_5M.zip \
--device cpu \
--checkpoint-interval 250000 \
--wandbPPO Configuration:
- Policy: MlpPolicy (multi-layer perceptron)
- Learning rate: 3e-4, Batch size: 64, Steps: 2048
- Epochs: 10, Gamma: 0.99, GAE lambda: 0.95
- Action masking for legal moves only
- Self-play training with automatic checkpointing
The project uses Weights & Biases (wandb) for tracking and visualizing training metrics in real-time.
Online Mode (Recommended):
- Install and login to wandb:
pip install wandb wandb login
- Run training with online mode:
WANDB_MODE=online python -m python_client.trainer \ --algo ppo \ --timesteps 5000000 \ --save models/rl_value_net_5M.zip \ --wandb - View dashboard: Visit https://wandb.ai and navigate to the
tablut-rlproject
Offline Mode (Default):
- Logs are saved locally in
wandb/directory - Sync later with:
wandb sync wandb/offline-run-*
The following metrics are logged to WandB during training:
- Episode rewards: Average and per-episode rewards
- Episode lengths: Number of steps per episode
- Hyperparameters: Learning rate, batch size, gamma, etc.
- Training progress: Timesteps, iterations, checkpoints
- Model artifacts: Final and best model checkpoints
Use the provided script to generate plots from WandB data:
# Make sure you're logged in to wandb
wandb login
# Generate plots from latest run
python scripts/generate_wandb_plots.py --project tablut-rl
# Or specify a specific run
python scripts/generate_wandb_plots.py --project tablut-rl --run-id <run_id>This will create plots in docs/images/ that you can add to the README:
When training with --wandb, you can monitor:
- Real-time training curves (rewards, episode lengths)
- Hyperparameter configurations
- System metrics (CPU, memory usage)
- Model checkpoints and artifacts
For detailed training documentation, see docs/model-training/TRAINING_GUIDE.md.
agent.py: Main entrypoint, socket communication, game loopminimax_agent.py: Alpha-beta search with iterative deepeningrl_value_wrapper.py: Extracts value function from PPO policytrainer.py: PPO self-play training with stable-baselines3tablut_env.py: Gymnasium environment for RL trainingheuristics.py: Move ordering and fallback evaluation
Java not found: Install with brew install openjdk@17 (macOS) or sudo apt-get install default-jdk (Linux)
Socket connection refused: Ensure Java server is running on port 5800 (white) or 5801 (black)
Timeout errors: Reduce --depth parameter
Import errors: Activate virtual environment and install dependencies
- Training Guide: PPO training details
- Deployment: Competition environment setup
- API Reference: Code examples and interfaces
- Fashad Ahmed Siddique - fashad.ahmedsiddique@studio.unibo.it | GitHub
- Andrea Pantieri - andrea.pantieri@studio.unibo.it | GitHub
- Giacomo Boschi - giacomo.boschi7@studio.unibo.it | GitHub
- Massimiliano Bolognini - massimilia.bolognini@studio.unibo.it | GitHub
Tablut Challenge Awards
In addition to the final ranking, several special prizes were awarded to recognize outstanding performances, strategies, and originality:
Best Name Award The Four Horsemen of Tabluting – for unquestionable coolness.
Secret Agents Award PythonAgen, ReplayAgent, MyAIPlayer (a.k.a. The Four Horsemen of Tabluting, techloria, Lions) – for never revealing their true identities.


