This is the companion code for the benchmarking study reported in the paper "Scaling RL for Autonomous Driving Is Not Enough: Behavior Benchmark for True Generalization", submitted to NeurIPS2026. The paper can be found here http://arxiv.org/abs/xxxx.xxxx. The code allows the users to reproduce and extend the results reported in the study. Please cite the above paper when reporting, reproducing or extending the results.
This software is a research prototype, solely developed for and published as part of the publication above.
The companion code is a fork of PufferDrive. Below are instructions in setting up PufferDrive as well as the BehaviorBench additions.
PufferDrive is a fast and friendly driving simulator to train and test RL-based models.
Docs: https://emerge-lab.github.io/PufferDrive
Clone the repo
https://github.com/Emerge-Lab/PufferDrive.gitMake a venv (uv venv), activate the venv
source .venv/bin/activate
Inside the venv, install the dependencies
uv pip install -e .
Compile the C code
python setup.py build_ext --inplace --force
Run this while your virtual environment is active so the extension is built against the right interpreter.
To test your setup, you can run
puffer train puffer_drive
See also the puffer docs.
Start a training run
puffer train puffer_drive
| Document | Description |
|---|---|
| EVALUATION.md | Planner evaluation framework (eval.py) - ego planners vs traffic controllers |
| REALISM_EVALUATION.md | WOSAC realism evaluation (eval_realism.py) - distributional realism metrics |
| BENCHMARK.md | Interactive benchmark extraction from Waymo data |
| TRAINING.md | PufferRL PPO training configuration and usage |
| PREDICTION.md | SMART prediction model - architecture, training, evaluation |
| NUPLAN.md | nuPlan integration - evaluate PufferDrive planners on nuPlan |
PufferDrive includes a config-driven evaluation framework for testing different planning algorithms as ego planners and traffic controllers. See EVALUATION.md for comprehensive documentation.
The repository includes pre-trained weights in weights/:
| File | Description |
|---|---|
weights/simple_ppo.pt |
PPO policy trained with self-play and simple reward on WOMD |
weights/conditioned_ppo.pt |
Conditioned PPO policy trained with self-play on WOMD |
weights/smart_epoch_030.pt |
SMART prediction model (1M params, 30 epochs) |
We provide two curated evaluation splits:
| Split | Name | Description |
|---|---|---|
interactive1k |
Interactive1k | 1,000 most interactive WOMD validation scenarios |
random1k |
Random1k | 1,000 random WOMD validation scenarios |
See BENCHMARK.md for data preprocessing, download instructions and benchmark details.
| Planner | Description |
|---|---|
pdm |
Predictive Driver Model. Proposes trajectory candidates via IDM with different velocities/offsets, selects best. |
ppo |
Pre-trained RL policy with LSTM. Requires checkpoint weights. |
smart |
SMART autoregressive trajectory prediction. Requires trained weights. |
idm |
Intelligent Driver Model. Rule-based lane-following. |
hybrid |
PPO + PDM |
conditioned_aggr / conditioned_normal / conditioned_caut |
Reward-conditioned PPO variants. |
constant_velocity |
Baseline that maintains current velocity. |
| Traffic Agent | Description |
|---|---|
idm |
Intelligent Driver Model. Default traffic controller, rule-based lane-following. |
pdm |
Predictive Driver Model used as traffic. |
ppo |
Pre-trained RL policy used for traffic agents. Requires checkpoint weights. |
smart |
SMART autoregressive trajectory prediction for traffic. Requires trained weights. |
conditioned_mix / conditioned_aggr / conditioned_normal / conditioned_caut |
Reward-conditioned PPO traffic variants. |
expert |
Ground truth trajectory replay from the Waymo dataset (traffic only). |
constant_velocity |
Baseline that maintains current velocity. |
export DRIVE_BINARIES_DATA_ROOT=/path/to/binaries
# PDM ego vs IDM traffic (default)
python pufferlib/ocean/benchmark/eval.py --map-ids 0-10
# SMART ego planner (using provided weights)
python pufferlib/ocean/benchmark/eval.py --planner.type smart \
--planner.smart.weights-path weights/smart_1M_epoch_029.pt --map-ids 0-10
# PPO ego vs SMART traffic (using provided weights)
python pufferlib/ocean/benchmark/eval.py --planner.type ppo \
--planner.ppo.weights-path weights/ppo_self_play.pt \
--traffic.type smart --traffic.smart.weights-path weights/smart_1M_epoch_029.pt
# Enable visualization
python pufferlib/ocean/benchmark/eval.py --eval.viz True --map-ids 5PufferDrive uses PufferLib for RL training with PPO (Proximal Policy Optimization). Training is configured via INI config files and command-line overrides.
# Start training with default config
puffer train puffer_drive
# Train with custom parameters
puffer train puffer_drive --train.learning-rate 0.001 --train.batch-size 262144
# Resume from checkpoint
puffer train puffer_drive --load-model-path experiments/puffer_drive/checkpoints/model_001000.pt
# Evaluate a trained model
puffer eval puffer_drive --load-model-path experiments/puffer_drive/checkpoints/model_001000.pt- Environment Creation: Multiple vectorized Drive environments are created across
num_workersprocesses, each managingnum_envsenvironments - Experience Collection: The policy collects rollouts of length
bptt_horizonacross all environments simultaneously, filling a batch of sizebatch_size - Policy Update: The batch is split into
minibatch_sizechunks and PPO gradient updates are applied - Repeat: Steps 2-3 repeat until
total_timestepsis reached - Checkpointing: Model weights are saved every
checkpoint_intervalupdates toexperiments/puffer_drive/checkpoints/
The training configuration is defined in two INI files:
pufferlib/config/default.ini: Base defaults for all environmentspufferlib/config/ocean/drive.ini: Drive-specific overrides
Any parameter can be overridden via command-line: --section.parameter value
| Parameter | Default | Description |
|---|---|---|
num_agents |
1024 | Total agents managed per environment instance (across all loaded maps) |
action_type |
continuous |
Action space type: discrete (7x13=91 actions) or continuous |
dynamics_model |
classic |
Vehicle dynamics: classic (acceleration + steering) or jerk (jerk-based, 4x3=12 actions) |
dt |
0.1 | Simulation timestep in seconds (10 Hz, matching WOMD) |
episode_length |
91 | Steps per episode (91 steps = 9.1 seconds, full WOMD scenario) |
num_maps |
80000 | Number of map binaries to load |
split |
training |
Dataset split: training, validation, testing, or custom path |
init_steps |
0 | Initial trajectory steps to skip (0 = start from beginning) |
control_mode |
control_vehicles |
Which agents to control: control_vehicles, control_agents, control_wosac, control_sdc_only |
init_mode |
create_all_valid |
Agent creation: create_all_valid (all agents) or create_only_controlled |
resample_frequency |
910 | How often to resample new scenarios (in steps) |
termination_mode |
1 | 0 = terminate at episode_length, 1 = terminate after all agents are done |
Reward Parameters:
| Parameter | Default | Description |
|---|---|---|
reward_vehicle_collision |
-0.5 | Penalty for colliding with another vehicle |
reward_offroad_collision |
-0.5 | Penalty for going off-road |
reward_goal |
1.0 | Reward for reaching the goal |
reward_goal_post_respawn |
0.25 | Reward for reaching goals after first respawn |
Goal Behavior:
| Parameter | Default | Description |
|---|---|---|
goal_behavior |
3 | What happens when an agent reaches its goal: 0=respawn at start, 1=generate new goal, 2=stop, 3=remove agent |
goal_radius |
2.0 | Distance threshold (meters) to consider goal reached |
goal_speed |
100.0 | Maximum target speed towards goal (m/s) |
goal_target_distance |
30.0 | Distance for newly generated goals (when goal_behavior=1) |
Collision/Offroad Handling:
| Parameter | Default | Description |
|---|---|---|
collision_behavior |
2 | On collision: 0=ignore, 1=stop agent, 2=remove agent |
offroad_behavior |
2 | On offroad: 0=ignore, 1=stop agent, 2=remove agent |
| Parameter | Default | Description |
|---|---|---|
total_timesteps |
2,000,000,000 | Total training steps (2 billion) |
learning_rate |
0.003 | Initial learning rate |
anneal_lr |
True | Linearly anneal learning rate to 0 |
batch_size |
524288 | Total samples per training epoch (= num_agents * num_workers * bptt_horizon) |
minibatch_size |
32768 | Mini-batch size for gradient updates |
bptt_horizon |
32 | Backpropagation-through-time horizon (sequence length per update) |
gamma |
0.98 | Discount factor |
gae_lambda |
0.95 | GAE smoothing parameter |
clip_coef |
0.2 | PPO clipping coefficient |
vf_coef |
2.0 | Value function loss weight |
vf_clip_coef |
0.2 | Value function clipping coefficient |
ent_coef |
0.005 | Entropy bonus coefficient (encourages exploration) |
max_grad_norm |
1.0 | Gradient clipping norm |
update_epochs |
1 | Number of PPO epochs per batch |
checkpoint_interval |
1000 | Save model every N updates |
Optimizer Parameters:
| Parameter | Default | Description |
|---|---|---|
optimizer |
muon |
Optimizer: adam or muon |
adam_beta1 |
0.9 | Adam beta1 (momentum) |
adam_beta2 |
0.999 | Adam beta2 (RMSProp-like) |
adam_eps |
1e-8 | Adam epsilon for numerical stability |
Advanced Parameters:
| Parameter | Default | Description |
|---|---|---|
prio_alpha |
0.85 | Priority experience sampling alpha |
prio_beta0 |
0.85 | Priority experience sampling beta |
vtrace_rho_clip |
1.0 | V-trace rho clipping (importance sampling) |
vtrace_c_clip |
1.0 | V-trace c clipping |
| Parameter | Default | Description |
|---|---|---|
num_workers |
16 | Number of parallel worker processes |
num_envs |
16 | Number of environments per worker |
batch_size |
4 | Environments per batch in vectorized stepping |
| Parameter | Default | Description |
|---|---|---|
input_size |
64 | First hidden layer size |
hidden_size |
256 | Main hidden layer size |
The observation is a flat vector with three components:
-
Ego features (7 for classic, 10 for jerk dynamics):
- Position relative to goal, speed, heading, steering, acceleration
-
Partner features (7 per agent, max 31 agents = 217):
- Relative position, speed, heading, distance to each nearby agent
-
Road features (7 per segment, max 128 segments = 896):
- Relative position, orientation, type for nearest road segments
Total observation size: 7 + 217 + 896 = 1120 (classic dynamics)
# Run a sweep over learning rate, entropy, and gamma
puffer sweep puffer_driveSweep ranges are defined in drive.ini under [sweep.*] sections.
# Standard evaluation
puffer eval puffer_drive --load-model-path model.pt
# WOSAC realism evaluation (distributional metrics)
puffer eval puffer_drive --eval.wosac-realism-eval True --load-model-path model.pt
# Human replay evaluation (SDC only, others follow logs)
puffer eval puffer_drive --eval.human-replay-eval True --load-model-path model.ptDownloading and using data
You can download the WOMD data from Hugging Face in two versions:
- Mini dataset: GPUDrive_mini contains 1,000 training files and 300 test/validation files
- Medium dataset: 10,000 files from the training dataset
- Large dataset: GPUDrive contains 100,000 unique scenes
Note: Replace 'GPUDrive_mini' with 'GPUDrive' in your download commands if you want to use the full dataset.
For more training data compatible with PufferDrive, see ScenarioMax. The GPUDrive data format is fully compatible with PufferDrive.
Dependencies and usage
To launch an interactive renderer, first build:
bash scripts/build_ocean.sh drive local
then launch:
./drivethis will run demo() with an existing model checkpoint.
Run the Raylib visualizer on a headless server and export as .mp4. This will rollout the pre-trained policy in the env.
sudo apt update
sudo apt install ffmpeg xvfbFor HPC (There are no root privileges), so install into the conda environment
conda install -c conda-forge xorg-x11-server-xvfb-cos6-x86_64
conda install -c conda-forge ffmpegffmpeg: Video processing and conversionxvfb: Virtual display for headless environments
- Build the application:
bash scripts/build_ocean.sh visualize local- Run with virtual display:
xvfb-run -s "-screen 0 1280x720x24" ./visualizeThe -s flag sets up a virtual screen at 1280x720 resolution with 24-bit color depth.
To force a rebuild, you can delete the cached compiled executable binary using
rm ./visualize.
We provide a PufferDrive implementation of the Waymo Open Sim Agents Challenge (WOSAC) for fast, easy evaluation of how well your trained agent matches distributional properties of human behavior. See documentation here.
WOSAC evaluation with random policy:
puffer eval puffer_drive --eval.wosac-realism-eval TrueWOSAC evaluation with your checkpoint (must be .pt file):
puffer eval puffer_drive --eval.wosac-realism-eval True --load-model-path <your-trained-policy>.ptYou may be interested in how compatible your agent is with human partners. For this purpose, we support an eval where your policy only controls the self-driving car (SDC). The rest of the agents in the scene are stepped using the logs. While it is not a perfect eval since the human partners here are static, it will still give you a sense of how closely aligned your agent's behavior is to how people drive. You can run it like this:
puffer eval puffer_drive --eval.human-replay-eval True --load-model-path <your-trained-policy>.ptDocumentation and browser demo
Docs
A browsable documentation site now lives under docs/ and is configured with mkbooks. To preview locally:
brew install mdbook
mdbook serve --open docs
Open the served URL to see a local version of the docs.
Interactive demo
To edit the browser demo, follow these steps:
- Download emscripten
- emscripten install latest
- Activate:
source emsdk/emsdk_env.sh - Run
bash scripts/build_ocean.sh drive web - This generates a number of
game*files, move them toassets/to include them on the webpage
If you use PufferDrive in your research, please cite:
@software{pufferdrive2025github,
author = {Daphne Cornelisse* and Spencer Cheng* and Pragnay Mandavilli and Julian Hunt and Kevin Joseph and Waël Doulazmi and Aditya Gupta and Eugene Vinitsky},
title = {{PufferDrive}: A Fast and Friendly Driving Simulator for Training and Evaluating {RL} Agents},
url = {https://github.com/Emerge-Lab/PufferDrive},
version = {2.0.0},
year = {2025},
}