A simulation framework, written in Python with pygame, for studying how a
swarm of differential-drive robots can learn to keep a formation while
travelling toward a goal. The headline contribution is a multi-verse (parallel
worlds) variant of Q-learning: several independent simulations run in parallel
and periodically exchange the best Q-tables when a single world is the leader
on both the formation-disruption and trajectory-disruption metrics.
The project was developed as part of a master's thesis. It includes both a classical, distance-based formation controller (used as a baseline / reference behaviour) and a learning controller trained from scratch with tabular Q-learning.
thesis_robotics/
├── base/ # Active codebase
│ ├── main.py # Entry point (training / demo / baseline)
│ ├── q_learn.py # QLearn(Simulation): the Q-learning loop
│ ├── simulation.py # Simulation: classical/baseline simulator
│ ├── formation.py # Formation + Trajectory definitions
│ ├── state.py # State representation seen by the agents
│ ├── sensor.py # Ultrasonic ray-cast sensor
│ ├── graphics.py # pygame rendering
│ ├── equilateral.py # Helper to compute the third vertex of a triangle
│ ├── results.py # Plot stored reward curves
│ ├── simple.job # Slurm batch script (headless training)
│ ├── dependencies # Plain-text list of Python deps
│ ├── robots/
│ │ ├── robot.py # Base Robot (kinematics, sensors, paths)
│ │ ├── distance_robot.py # Potential-field distance controller (baseline)
│ │ ├── learn_robot.py # Q-learning agent
│ │ └── swarm.py # Swarm + LearnSwarm aggregators
│ ├── utils/
│ │ ├── constants.py # Action / direction / spacing constants
│ │ ├── dimensions.py # Map, robot, sensor sizes
│ │ ├── counter.py # Default-zero dict used as Q-table
│ │ ├── utils.py # Geometry / heading / colour helpers
│ │ ├── qlearn_utils.py # Pickle save/load, plotting, spacing helpers
│ │ └── simulation_utils.py # Distance logs, pygame quit handling
│ ├── sprites/ # Map, robot and area images (.png/.kra)
│ └── TRAINED_FILES/ # Pickled Q-tables and reward/distance logs
├── v3_backup/ # Older snapshot of main.py / learn_robot.py + logs
├── .vscode/
└── .gitignore
The active code lives under base/. v3_backup/ keeps a previous iteration
of the same files for reference and is excluded by .gitignore.
Python 3.8+ and the libraries listed in base/dependencies:
pygamenumpyscipyscikit-learnmatplotlib
Install with:
pip install pygame numpy scipy scikit-learn matplotlibThe simulator opens a pygame window by default. On a server / cluster, run
with --headless to use the SDL dummy video driver.
All commands are run from inside base/:
cd baseLoads the pickled Q-tables in TRAINED_FILES/trained_controller and replays
each learned world (or a random one if ALL = False in main.py):
python3 main.py --demoSet Q_LEARN = False in main.py and run any version. The baseline uses the
distance / potential-field controller from robots/distance_robot.py:
python3 main.py -v 0main.py exposes four behaviour modes via -v/--version (mapped to the
PROGRESS variable):
-v |
mode |
|---|---|
-1 |
Demo (same as --demo) |
0 |
Single-world training, optionally pooled |
1 |
Parallel worlds, no info exchange |
2 |
Parallel worlds with best-world info exchange |
The default (no flag) is -v 2, the multi-verse variant studied in the
thesis. To resume from the last checkpoint in TRAINED_FILES/:
python3 main.py -v 2 --resumeTo run on a headless machine (no display):
python3 main.py -v 2 --headlessA Slurm batch script (simple.job) is provided for cluster runs:
sbatch simple.jobAfter (or during) training, regenerate REWARDS.png from the pickled reward
log:
python3 results.pyThe map (sprites/MAP.png) is 800 × 300 px and contains black obstacles
that the robots' ultrasonic ray-cast sensors detect by reading pixel colours
(sensor.py). Three robots are spawned in a tight triangle on the left edge
and must reach a goal point on the right while keeping their relative
distances close to ideal_dist = 50 px.
Robot(robots/robot.py) — differential-drive kinematics, anUltrasonicsensor withn_rays = 9over a 180° fan, and a basic obstacle-avoidance controller that biases left/right wheel speeds based on ray distances.DistanceRobot— overrides the controller with an artificial-potential field (_ro_ij,_p_ij_tildainutils/utils.py) that pulls each robot toward the desired inter-robot distance. Used as the non-learning baseline.LearnRobot— Q-learning agent. Holds its ownCounter-backed Q-table (utils/counter.py), an action history, and a small set of discrete actions: hard heading change to one of 8 compass directions plusACCELERATE/DECELERATE/STRAIGHT.
For each robot (state.py):
[ self.heading,
towards_goal (bool),
spacing(other₁) ∈ {IN_RANGE, TOO_FAR, TOO_CLOSE},
relative_direction(other₁),
other₁.heading,
spacing(other₂),
relative_direction(other₂),
other₂.heading ]
The continuous heading is bucketed into 8 directions
(get_direction_from_heading) and inter-robot distances are bucketed into
3 spacing classes around ideal_dist (±15 px tolerance).
Implemented in LearnRobot.compute_reward:
+30000 / dist_to_endpoint— pull toward the goal.+100per neighbour that isIN_RANGE.-10per neighbour that isTOO_FARorTOO_CLOSE.-1per step (penalises slow solutions).
q_learn.py runs the standard tabular update
Q(s,a) ← Q(s,a) + α · ( r + γ · maxₐ' Q(s',a') − Q(s,a) )
at a fixed timer step (0.05 s / training_speed) inside the pygame loop.
Default hyper-parameters in main.py:
α = 0.7(learning rate)γ = 0.9(discount)ρ = 0.2(ε-greedy exploration probability)train_iterations = 1000training_speed = 10(simulation frames per real frame)sim_duration = 1500ticks per episode
A "STRAIGHT_START" warm-up forces every agent to drive straight for the first
train_iterations / 8 ticks so that the early Q-table is bootstrapped from
sensible trajectories instead of random spinning.
Each iteration spins up learning_worlds independent QLearn simulations
in parallel via multiprocessing.Pool. After every iteration each world
returns:
formation_disr— accumulated triangle-area deviation (formation quality).traj_disr— variance of trapezoidal areas under each robot's path (trajectory quality, computed inTrajectory.compute_total_traj_disruption).- the updated Q-tables, total rewards and distance logs.
If the same world is ranked first on both formation_disr and traj_disr,
its Q-tables are broadcast to every other world before the next iteration —
otherwise each world keeps its own tables. The exchange counter and the
iteration index of every exchange are persisted to
TRAINED_FILES/info_exch[_counter].
Checkpoints are written every 10 iterations to TRAINED_FILES/:
| File | Contents |
|---|---|
trained_controller |
List of Q-tables, one per learning world |
tot_avg_rewards |
Mean rewards per iteration across worlds |
tot_min_rewards |
Min rewards per iteration |
tot_max_rewards |
Max rewards per iteration |
all_dists_logs |
Per-pair inter-robot distance averages |
info_exch |
Iterations on which an exchange happened |
info_exch_counter |
Total number of exchanges |
iter_counter |
Cumulative iteration counter |
--resume reloads all of the above and continues training.
main.py
└── QLearn(Simulation) q_learn.py
├── LearnSwarm(Swarm) robots/swarm.py
│ └── LearnRobot(Robot) ×3 robots/learn_robot.py
│ └── State state.py
├── Formation formation.py
│ └── Trajectory ×3
└── Graphics (pygame) graphics.py
The most useful knobs live near the top of base/main.py:
Q_LEARN = False # baseline vs Q-learn
training_speed = 10
exploration_rho = 0.2 # ε
lr_alpha = 0.7 # α
discount_rate_gamma = 0.9 # γ
train_iterations = 1000
RANDOM_START = False
STRAIGHT_START = True
learning_worlds = 3 # parallel worlds (Pool size)
formation_discount = 0.9
trajectory_discount = 0.7Map / sensor / robot sizes live in base/utils/dimensions.py and the action
& direction vocabularies in base/utils/constants.py.
- Some files include an OS-specific Python
matchstatement that has been commented out and replaced withif/elifchains, for compatibility with Python ≤ 3.9. Re-enable thematchversions if you are on 3.10+. utils/qlearn_utils.pyredefinesGLOBAL_DIRECTIONSandLOCAL_DIRECTIONSwith values that differ fromutils/constants.py. The constants fromconstants.pyare the ones actually used by the agents — the duplicates at the bottom ofqlearn_utils.pyare dead code.main.pycontains twoif __name__ == "__main__" and PROGRESS == 0blocks back-to-back; only the first one ever runs.- The Slurm script in
simple.jobrequests thered/brownpartition and a 64-core node — adapt to your cluster before submitting. v3_backup/is an older snapshot kept for reference; it is excluded by.gitignoreand is not the code that is executed.
base/REWARDS.png and base/lol.png in the repository were produced by
running results.py against the checkpoints under base/TRAINED_FILES/.
After a training run, regenerate them with:
cd base
python3 results.py