A reinforcement learning agent that learns to solve mazes using Group Relative Policy Optimization (GRPO). Watch an AI learn to navigate through mazes in real-time!
This project demonstrates how an AI agent can learn to solve mazes through trial and error using GRPO—a variant of Proximal Policy Optimization (PPO) that leverages group-relative advantage estimation and iterative policy updates. The agent begins with no knowledge of the maze, collects trajectories, and updates its policy based on group comparisons. Additionally, a reward model is trained using a replay mechanism to progressively refine reward estimations.
-
Observation and Action:
The agent (displayed as a red square) observes its current position and the target location, then selects an action (up, down, left, or right) using its policy network. -
Trajectory Collection:
Multiple trajectories are collected in each episode. Each trajectory's cumulative reward is computed based on:- A positive reward (e.g., 1000) when the goal is reached.
- Negative rewards for hitting walls or not making progress.
- A small bonus per step to encourage faster solutions.
-
Group Relative Advantage Estimation:
After collecting a batch of trajectories, the agent computes relative advantages by comparing each trajectory's reward to the group average. This group-relative advantage helps guide the policy update. -
Iterative GRPO Update (Equation 21):
The policy is updated over multiple GRPO iterations per trajectory update. The update uses:- A surrogate loss similar to PPO with clipping.
- A KL divergence penalty against a reference policy to maintain stability.
- Configurable hyperparameters such as the clipping parameter (
$\epsilon$ ), KL coefficient ($\beta$ ), and the number of GRPO iterations ($\mu$ ).
-
Reward Model Update:
The reward model is continuously updated via a replay mechanism using the experiences stored in an experience buffer. An MSE loss is used (which can be replaced with a contrastive loss) to refine reward predictions over time.
-
Real-time Visualization:
Watch the agent learn in real-time using PyGame. -
Iterative GRPO Updates:
The agent performs multiple GRPO iterations per trajectory update to efficiently optimize its policy based on group-relative advantages. -
Reward Model Replay:
The reward model is updated using a replay buffer to ensure robust reward estimation. -
Configurable Hyperparameters:
Easily adjust key parameters such as the clipping parameter ($\epsilon$ ), KL coefficient ($\beta$ ), and GRPO iteration count ($\mu$ ). -
Performance Monitoring:
Training progress is logged via TensorBoard, showing metrics such as average reward, best reward, success rate, and steps per episode.
python >= 3.8
torch
pygame
numpy
tensorboard-
Clone the repository:
git clone https://github.com/jeffasante/grpo-maze-solver cd grpo-maze-solver -
Install dependencies:
pip install torch numpy pygame tensorboard
To train a new agent with GRPO:
python maze_grpo.pyThe script will automatically detect the best available device (CUDA, MPS, or CPU). On Apple Silicon Macs, it will use MPS for accelerated training.
Note:
The training script now accepts a new parameter grpo_iterations (default is 3) that controls how many GRPO updates are performed per trajectory update. You can modify this (and other hyperparameters such as epsilon and beta) directly in the code.
An example of the training output might look like:
Using device: mps
Starting Iteration 1/3
Starting Episode 1/5000
Trajectory 1: Success! Steps = 231, Final Reward = 1057.52
Trajectory 2: Success! Steps = 214, Final Reward = 1055.43
Trajectory 3: Success! Steps = 790, Final Reward = 1202.52
Trajectory 4: Success! Steps = 367, Final Reward = 1095.95
Trajectory 5: Success! Steps = 880, Final Reward = 1227.65
Trajectory 6: Success! Steps = 737, Final Reward = 1187.62
...
New best model saved! Reward: 1227.65
To test a trained agent:
python test_maze.py-
Policy Network (GRPONetwork):
Receives the state input ([x, y, target_x, target_y]) and outputs action probabilities for each movement direction. -
Reward Network (GRPORewardNetwork):
Combines the state and one-hot encoded action to predict a reward value.
The GRPO update follows a PPO-like objective with an iterative update loop:
ratio = exp(new_log_probs - old_log_probs)
surr1 = ratio * advantage
surr2 = clamp(ratio, 1 - epsilon, 1 + epsilon) * advantage
policy_loss = -min(surr1, surr2).mean()
kl_loss = beta * KL(π_θ || π_ref)
total_loss = policy_loss + kl_lossThis objective is iteratively optimized (controlled by grpo_iterations) to better align the policy with group-relative advantages.
The reward model is updated using a replay mechanism:
- Experiences (state, action, reward) are stored in an ExperienceBuffer.
- When the buffer is sufficiently large, a batch is sampled and used to update the reward model via MSE loss (or another loss function as desired).
Key parameters include:
num_episodes = 5000 # Total training episodes
group_size = 64 # Number of trajectories per update
grpo_iterations = 3 # GRPO update iterations (μ)
learning_rate = 1e-4 # Learning rate for policy updates
gamma = 0.99 # Discount factor
epsilon = 0.2 # Clipping parameter for PPO-style loss
beta = 0.04 # KL divergence coefficientYou can adjust these values in the source code to suit your training requirements.
Training metrics are logged via TensorBoard. To view the logs:
tensorboard --logdir=runs/maze_training- Maze environment based on: FrankRuis's Maze Implementation
- Visualization using PyGame.
- Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models – Introduces the GRPO algorithm.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
This project is licensed under the MIT License – see the LICENSE file for details.
Author: Jeffrey Asante
Date: 2025-02-08
GitHub: github.com/jeffasante