Girish Krishnan | LinkedIn | GitHub
![]() |
![]() |
|---|
This project demonstrates Single and Multi-Agent Reinforcement Learning using Q-Learning on a grid-world environment. Agents start at position
- Python (NumPy)
- C++ (single-threaded)
- C++ with CUDA (GPU-accelerated)
Q-Learning is a model-free reinforcement learning algorithm where we learn an action-value function
where:
-
$s$ is the current state -
$a$ is the action taken -
$r$ is the reward received -
$s'$ is the next state -
$\alpha$ is the learning rate -
$\gamma$ is the discount factor -
$Q(s,a)$ is the Q-value for the current state-action pair -
$\max_{a'} Q(s',a')$ is the maximum Q-value for the next state
- Python (for
single_agent.pyandmulti_agent.py)
- Python 3.x
numpymatplotlibtqdmargparse
- C++ (for
single_agent.cppandmulti_agent.cpp)
- C++11 or later compiler
- ImageMagick (Magick++) if you want to generate GIFs of the policy in action
- CUDA (for
single_agent.cuandmulti_agent.cu)
- NVIDIA GPU with CUDA toolkit installed
- A C++ compiler that supports CUDA (e.g.
nvcc) - ImageMagick (Magick++) if you want to generate GIFs of the policy in action
Below are the command-line arguments for each script. Defaults match the original code.
| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 32 | Grid size (square) |
--n_mines |
int | 40 | Number of mines in the environment. |
--flag_x |
int | 31 | X-coordinate of the goal position. |
--flag_y |
int | 31 | Y-coordinate of the goal position. |
--episodes |
int | 20000 | Number of episodes to train the agent. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
--max_steps |
int | 100 | Maximum number of steps for visualization (GIF). |
Run Example:
python single_agent.py --size 32 --n_mines 40 --episodes 20000 --gif_path single_agent_policy.gif| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 32 | Grid size (square) |
--n_mines |
int | 40 | Number of mines in the environment. |
--flag_x |
int | 31 | X-coordinate of the goal position. |
--flag_y |
int | 31 | Y-coordinate of the goal position. |
--episodes |
int | 20000 | Number of episodes to train the agent. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--max_steps |
int | 200 | Maximum number of steps for visualization (GIF). |
--cell_size |
int | 20 | Size of each cell (in pixels) in the visualization. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
Compilation (example):
Note: You may need to use the -I flag to specify the path to the ImageMagick headers.
g++ -std=c++11 single_agent.cpp -o single_agent -lMagick++ -lMagickCore -lMagickWandRun Example:
./single_agent --size 32 --episodes 20000 --gif_path single_agent_policy.gif| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 32 | Grid size (square) |
--n_mines |
int | 40 | Number of mines in the environment. |
--flag_x |
int | 31 | X-coordinate of the goal position. |
--flag_y |
int | 31 | Y-coordinate of the goal position. |
--episodes |
int | 20000 | Number of episodes to train the agent. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--max_steps |
int | 1000 | Max steps per episode. |
--blocks |
int | 128 | Number of blocks in the CUDA grid. |
--threads_per_block |
int | 128 | Number of threads per block in CUDA. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
--cell_size |
int | 20 | Size of each cell (in pixels) in the visualization. |
--gif_max_steps |
int | 200 | Maximum number of steps for visualization (GIF). |
Compilation (example):
nvcc single_agent.cu -o single_agent_cuda -lMagick++ -lMagickCore -lMagickWandRun Example:
./single_agent_cuda --episodes 20000 --blocks 128 --threads_per_block 128 --gif_path single_agent_policy.gif| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 46 | Grid size (square) |
--n_mines |
int | 96 | Number of mines in the environment. |
--flag_x |
int | 45 | X-coordinate of the goal position. |
--flag_y |
int | 45 | Y-coordinate of the goal position. |
--n_agents |
int | 512 | Number of agents in the environment. |
--episodes |
int | 1000 | Number of episodes to train the agents. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
--max_steps |
int | 100 | Maximum number of steps for visualization (GIF). |
Run Example:
python multi_agent.py --size 46 --n_agents 512 --episodes 1000 --gif_path multiagent_policy.gif| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 46 | Grid size (square) |
--n_mines |
int | 96 | Number of mines in the environment. |
--flag_x |
int | 45 | X-coordinate of the goal position. |
--flag_y |
int | 45 | Y-coordinate of the goal position. |
--n_agents |
int | 512 | Number of agents in the environment. |
--episodes |
int | 1000 | Number of episodes to train the agents. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--max_steps |
int | 200 | Maximum number of steps for visualization (GIF). |
--cell_size |
int | 20 | Size of each cell (in pixels) in the visualization. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
Compilation (example):
g++ -std=c++11 multi_agent.cpp -o multi_agent -lMagick++ -lMagickCore -lMagickWandRun Example:
./multi_agent --size 46 --n_agents 512 --episodes 1000 --gif_path multiagent_policy.gif| Argument | Type | Default | Description |
|---|---|---|---|
--size |
int | 46 | Grid size (square) |
--n_mines |
int | 96 | Number of mines in the environment. |
--flag_x |
int | 45 | X-coordinate of the goal position. |
--flag_y |
int | 45 | Y-coordinate of the goal position. |
--n_agents |
int | 512 | Number of agents in the environment. |
--episodes |
int | 1000 | Number of episodes to train the agents. |
--alpha |
float | 0.1 | Learning rate. |
--gamma |
float | 0.9 | Discount factor. |
--epsilon |
float | 0.1 | Exploration rate. |
--max_steps_per_episode |
int | 1000 | Max steps per episode. |
--threads_per_block |
int | 256 | Number of threads per block in CUDA. |
--gif_path |
str | "" |
Path to save the GIF of the policy in action. If not provided, the GIF is not saved. |
--cell_size |
int | 20 | Size of each cell (in pixels) in the visualization. |
--gif_max_steps |
int | 200 | Maximum number of steps for visualization (GIF). |
Note: we don't have a blocks argument here because the number of blocks is calculated as blocks = (n_agents + threads_per_block - 1) / threads_per_block;.
Compilation (example):
nvcc multi_agent.cu -o multi_agent_cuda -lMagick++ -lMagickCore -lMagickWandRun Example:
./multi_agent_cuda --n_agents 512 --episodes 1000 --threads_per_block 256 --gif_path multiagent_policy.gifHere is an example output visualization obtained from running one of the Python programs:
Here is an example output visualization obtained from running one of the C++ programs. The visualization is slightly different because the C++ version uses ImageMagick instead of Matplotlib for generating the GIF.
- Initialize a Q-table of size
$\text{gridSize} \times \text{gridSize} \times 4$ , where the 4 represents the 4 possible actions (up, down, left, right). - Start the agent at position
$(0,0)$ . - At each step, choose an action
$a$ using an$\epsilon$ -greedy policy, where$\epsilon$ is the exploration rate. - Execute the action, observe the next state
$s'$ and the reward$r$ . - Update the Q-value using the Q-learning update rule.
$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]$$ - Repeat until the agent reaches the flag or hits a mine (episode ends).
- Repeat steps 2-6 for a fixed number of episodes.
- Multiple agents each maintain a position in the grid.
- Each agent picks an action and updates the global Q-table with the same rule as single-agent Q-learning.
- Once an agent dies (hits a mine) or finishes (reaches flag), it becomes inactive.
- If fewer than 20% of the agents are active, reset the environment and start a new episode.
- In single-agent CUDA, each thread in a CUDA block runs an entire episode independently.
- In multi-agent CUDA, all agents share a global Q-table. A kernel reset initializes agent states in parallel, and then each agent chooses actions concurrently, updating Q using atomic operations.
- This allows many episodes (or many agents) to run in parallel on the GPU, significantly speeding up training.
This project is licensed under the MIT License - see the LICENSE file for more details.


