A tiny, inspectable research scaffold for strict-online GRPO on code-repair agents.
The project answers one question:
What is the smallest complete system that can sample multi-turn code-repair trajectories, score them with hidden tests, compute group-relative advantages, and update a policy without replaying stale rollouts?
English | 中文
Most LLM RL examples optimize a single final response. Code agents are different: they read files, edit code, run visible tests, observe failures, and decide when to finish. The reward should come from an external verifier, not from text that appears in the prompt.
This repo keeps that full agentic loop small enough to read:
task
-> current policy samples a trajectory
-> agent emits JSON tool calls
-> environment executes read/write/test/finish
-> hidden-test verifier scores the final code
-> rewards are normalized within the same-task group
-> policy updates once
-> old trajectories are logs, not replay data
The baseline is intentionally tiny. agent.py is the whole idea in one
dependency-free file. The agentic_code_grpo/ package keeps the same concepts
but separates data, environment, rollout, reward, evaluation, and training so the
toy loop can grow toward real model experiments.
Run the one-file reference first:
python agent.pyYou should see a deterministic toy benchmark, a base pass@5, training epochs,
and a final pass@5.
Install the package in editable mode and run the integration smoke test:
python -m pip install -e ".[dev]"
./scripts/smoke.shThe smoke test generates benchmark tasks, evaluates the base policy, trains the fixed-strategy GRPO baseline, then runs the strict-online JSON tool-call loop. Outputs are written under ignored local directories:
benchmarks/mini_project_repair/
trajectories/
runs/
Run the focused unit tests:
python -m pytestagent.py contains the full training loop in one file:
generate tasks -> sample strategy -> patch code -> run visible tests
-> run hidden tests -> score reward -> compute group advantage -> update logits
It deliberately avoids classes, framework dependencies, model servers, and distributed infrastructure. It is the best file to read when you want the mathematical shape of the loop without any system noise.
The package version is still small, but it has real boundaries:
| File | Role |
|---|---|
data.py |
RepairTask schema plus JSONL read/write helpers. |
tasks.py |
Deterministic 200-task mini code-repair benchmark. |
project_tasks.py |
Level-2 200-task multi-file project bug benchmark. |
data_factory.py |
Data generation, quality gates, split writing, and manifest output. |
env.py |
Tool environment for list_files, read_file, write_file, run_tests, finish. |
policy.py |
Softmax policy over repair strategies plus save/load helpers. |
rollout.py |
Fixed-strategy baseline rollout. |
online_rollout.py |
Multi-turn JSON tool-call rollout loop. |
reward.py |
Hidden-test reward, shaping terms, ranking, group advantages. |
metrics.py |
Shared pass@k and trajectory-quality metrics. |
eval.py |
Evaluation CLI for base or saved policies. |
grpo.py |
Minimal grouped policy optimization loop. |
online_grpo.py |
Strict-online GRPO loop with tool-call trajectories. |
The important replacement point is the rollout client. The toy
scripted_client(policy) can be swapped for a real model client while the
environment, verifier, reward, metrics, and strict-online bookkeeping stay the
same.
Each task is a tiny code-repair problem:
task_id stable id for logging and grouping
issue natural-language repair request
files starting workspace; Level 2 usually has multiple package files
visible_tests tests the agent may trigger during rollout
hidden_tests verifier-only tests; never included in the prompt
bug_type toy-policy context label
level function or project; the data factory defaults to project
repair_file file the oracle patch edits, used by the toy scripted policy
project_kind small project family, such as checkout_service or analytics_dashboard
bug_type exists only so a no-dependency softmax policy can learn something
measurable. In a real model experiment, the model should infer repairs from the
issue, files, and visible feedback rather than receiving a bug label as an
answer.
Do not treat the dataset as "some JSONL files." The default data factory now
builds Level-2 project tasks: each task has a small multi-file project, a natural
language issue, visible tests, hidden tests, and one source file that must be
edited. The agent has to list_files, read_file, write_file, run visible
tests, and then let the verifier score the final project.
Treat the data as a small factory where every training example can be checked:
small project template -> inject one clear bug -> visible tests for agent debugging
-> hidden tests for verifier reward -> quality gates -> manifest audit trail
Each task should prove four things: the buggy code fails, the oracle patch passes, hidden tests never enter the prompt, and train/dev/test splits are stable. That keeps the model learning to repair code for verifier reward instead of learning to memorize test answers.
Generate the benchmark with the data factory:
python -m agentic_code_grpo.data_factory --level project --out benchmarks/mini_project_repairThis writes the split JSONL files plus manifest.json, which records the
generator method, bug-type distribution, difficulty distribution, and quality
gate results. The data factory checks that buggy solutions fail, oracle patches
pass, and hidden tests are not leaked into the policy context.
The lower-level deterministic task generator is still available:
python -m agentic_code_grpo.tasks --out benchmarks/mini_code_repairTo generate the older single-file function tasks through the same quality gates:
python -m agentic_code_grpo.data_factory --level function --out benchmarks/mini_code_repairThe split is deterministic:
120 train / 40 dev / 40 test
The Level-2 data factory is the baseline method. The paper-inspired variants for scaling it are documented under docs/data-method-variants/:
01_template_project_mutation extend the current synthetic project templates
02_repo_test_breaking mutate real repos until existing tests fail
03_commit_backtranslation start from real commits and generate issues
04_stateful_environment_tasks verify final state and collateral damage
05_hybrid_verifier_selection combine execution, static, and trace verifiers
Generated artifacts should go under data_runs/, not directly into git-tracked docs:
data_runs/<method>/raw/ source material and temporary generated assets
data_runs/<method>/candidates/ tasks before quality gates
data_runs/<method>/accepted/ train/dev/test JSONL after quality gates
data_runs/<method>/manifests/ manifest and validation reports
data_runs/<method>/trajectories/ agent rollouts
data_runs/<method>/logs/ generation, verification, and training logs
The repo keeps this directory structure, while large generated files are ignored
by git. Once a run is stable, export its accepted split into benchmarks/ for
smoke tests or training.
The reward is verifier-led:
| Component | Value | Purpose |
|---|---|---|
| Hidden tests pass | +1.00 |
Main success signal. |
| Visible tests pass | +0.20 |
Small shaping signal. |
| Finish action | +0.05 |
Encourages explicit completion. |
| Per turn | -0.02 |
Prefers concise successful trajectories. |
| Invalid action | -0.20 |
Penalizes malformed or forbidden tool calls. |
Hidden tests never appear in model context. The agent can call visible tests, but the final reward is decided by the hidden-test verifier.
Evaluation reports:
pass@k
avg_success_turns
invalid_action_rate
test_call_efficiency
reward_rank_accuracy
num_tasks
num_trajectories
reward_rank_accuracy checks whether, for the same task, successful trajectories
rank above failing trajectories. It is a sanity check that the reward function is
not rewarding the wrong behavior.
Evaluate the base policy:
python -m agentic_code_grpo.eval \
--tasks benchmarks/mini_project_repair/test.jsonl \
--policy base \
--k 1 \
--out trajectories/base_eval.jsonlTrain the simple GRPO baseline:
python -m agentic_code_grpo.grpo \
--train benchmarks/mini_project_repair/train.jsonl \
--test benchmarks/mini_project_repair/test.jsonl \
--out runs/grpo_demoTrain the strict-online tool-call loop:
python -m agentic_code_grpo.online_grpo \
--train benchmarks/mini_project_repair/train.jsonl \
--test benchmarks/mini_project_repair/test.jsonl \
--out runs/strict_online_grpoStrict-online means every update uses fresh rollouts sampled from the current policy. Saved trajectories are useful for inspection, but their old advantages are not reused for later training.
If you want to understand the system from zero, read in this order:
1. README.md
2. agent.py
3. examples/first_agentic_rl/
4. docs/agentic-rl-explained.md
5. docs/data-engineering.md
6. docs/project-data-synthesis.md
7. docs/data-method-variants/
8. data_runs/README.md
9. agentic_code_grpo/online_rollout.py
10. agentic_code_grpo/online_grpo.py
11. docs/strict-online-rl-roadmap.md
The examples/first_agentic_rl/ folder contains six standalone scripts:
1-make-data.py
2-sandbox-env.py
3-rollout.py
4-reward.py
5-strict-online-train-loop.py
6-stitch-rl-system.py
They show how the full package grows from small pieces.
The intended path is:
agent.py
-> package-level toy policy
-> Transformers model client
-> LoRA GRPO update
-> vLLM or SGLang rollout serving
-> distributed RL framework
When replacing the toy policy, keep these invariants:
- Hidden tests are reward-only and never enter prompts.
- Tool calls are parsed through a strict schema.
- Invalid actions are recorded and penalized.
- GRPO advantages are computed within same-task rollout groups.
- Old trajectories are logs, not a replay buffer for GRPO.
- Loss is applied to assistant action tokens, not tool observations.
- Rollouts carry policy/version metadata once generation is distributed.
See docs/frameworks.md for when TRL, OpenRLHF, verl, slime, NeMo RL, RLite, or RLinf become useful.
This is not a production sandbox for arbitrary untrusted code. The educational environment is useful for explaining interfaces, verifier rewards, and training freshness. A real code-agent RL system still needs stronger process isolation, timeouts, resource limits, network controls, artifact storage, and monitoring.
This is also not a SOTA claim. The mini benchmark is deliberately synthetic. Use it to validate the loop, metrics, and invariants before moving to harder external benchmarks.
- Docs index
- Agentic RL explained from zero
- Project-level data synthesis
- Data method variants
- Data run artifact layout
- Framework guide
- Strict-online RL roadmap
- Experiment plan
MIT