Skip to content

walkinglabs/mini-agentic-code-grpo

Repository files navigation

Mini Agentic Code GRPO

A tiny, inspectable research scaffold for strict-online GRPO on code-repair agents.

The project answers one question:

What is the smallest complete system that can sample multi-turn code-repair trajectories, score them with hidden tests, compute group-relative advantages, and update a policy without replaying stale rollouts?

English | 中文

Why This Exists

Most LLM RL examples optimize a single final response. Code agents are different: they read files, edit code, run visible tests, observe failures, and decide when to finish. The reward should come from an external verifier, not from text that appears in the prompt.

This repo keeps that full agentic loop small enough to read:

task
  -> current policy samples a trajectory
  -> agent emits JSON tool calls
  -> environment executes read/write/test/finish
  -> hidden-test verifier scores the final code
  -> rewards are normalized within the same-task group
  -> policy updates once
  -> old trajectories are logs, not replay data

The baseline is intentionally tiny. agent.py is the whole idea in one dependency-free file. The agentic_code_grpo/ package keeps the same concepts but separates data, environment, rollout, reward, evaluation, and training so the toy loop can grow toward real model experiments.

Quickstart

Run the one-file reference first:

python agent.py

You should see a deterministic toy benchmark, a base pass@5, training epochs, and a final pass@5.

Install the package in editable mode and run the integration smoke test:

python -m pip install -e ".[dev]"
./scripts/smoke.sh

The smoke test generates benchmark tasks, evaluates the base policy, trains the fixed-strategy GRPO baseline, then runs the strict-online JSON tool-call loop. Outputs are written under ignored local directories:

benchmarks/mini_project_repair/
trajectories/
runs/

Run the focused unit tests:

python -m pytest

Two Levels Of The Project

1. agent.py: the minimal reference

agent.py contains the full training loop in one file:

generate tasks -> sample strategy -> patch code -> run visible tests
-> run hidden tests -> score reward -> compute group advantage -> update logits

It deliberately avoids classes, framework dependencies, model servers, and distributed infrastructure. It is the best file to read when you want the mathematical shape of the loop without any system noise.

2. agentic_code_grpo/: the experiment scaffold

The package version is still small, but it has real boundaries:

File Role
data.py RepairTask schema plus JSONL read/write helpers.
tasks.py Deterministic 200-task mini code-repair benchmark.
project_tasks.py Level-2 200-task multi-file project bug benchmark.
data_factory.py Data generation, quality gates, split writing, and manifest output.
env.py Tool environment for list_files, read_file, write_file, run_tests, finish.
policy.py Softmax policy over repair strategies plus save/load helpers.
rollout.py Fixed-strategy baseline rollout.
online_rollout.py Multi-turn JSON tool-call rollout loop.
reward.py Hidden-test reward, shaping terms, ranking, group advantages.
metrics.py Shared pass@k and trajectory-quality metrics.
eval.py Evaluation CLI for base or saved policies.
grpo.py Minimal grouped policy optimization loop.
online_grpo.py Strict-online GRPO loop with tool-call trajectories.

The important replacement point is the rollout client. The toy scripted_client(policy) can be swapped for a real model client while the environment, verifier, reward, metrics, and strict-online bookkeeping stay the same.

Core Data Model

Each task is a tiny code-repair problem:

task_id        stable id for logging and grouping
issue          natural-language repair request
files          starting workspace; Level 2 usually has multiple package files
visible_tests  tests the agent may trigger during rollout
hidden_tests   verifier-only tests; never included in the prompt
bug_type       toy-policy context label
level          function or project; the data factory defaults to project
repair_file    file the oracle patch edits, used by the toy scripted policy
project_kind   small project family, such as checkout_service or analytics_dashboard

bug_type exists only so a no-dependency softmax policy can learn something measurable. In a real model experiment, the model should infer repairs from the issue, files, and visible feedback rather than receiving a bug label as an answer.

Data Engineering Method

Do not treat the dataset as "some JSONL files." The default data factory now builds Level-2 project tasks: each task has a small multi-file project, a natural language issue, visible tests, hidden tests, and one source file that must be edited. The agent has to list_files, read_file, write_file, run visible tests, and then let the verifier score the final project.

Treat the data as a small factory where every training example can be checked:

small project template -> inject one clear bug -> visible tests for agent debugging
-> hidden tests for verifier reward -> quality gates -> manifest audit trail

Each task should prove four things: the buggy code fails, the oracle patch passes, hidden tests never enter the prompt, and train/dev/test splits are stable. That keeps the model learning to repair code for verifier reward instead of learning to memorize test answers.

Generate the benchmark with the data factory:

python -m agentic_code_grpo.data_factory --level project --out benchmarks/mini_project_repair

This writes the split JSONL files plus manifest.json, which records the generator method, bug-type distribution, difficulty distribution, and quality gate results. The data factory checks that buggy solutions fail, oracle patches pass, and hidden tests are not leaked into the policy context.

The lower-level deterministic task generator is still available:

python -m agentic_code_grpo.tasks --out benchmarks/mini_code_repair

To generate the older single-file function tasks through the same quality gates:

python -m agentic_code_grpo.data_factory --level function --out benchmarks/mini_code_repair

The split is deterministic:

120 train / 40 dev / 40 test

Data Method Variants And Artifacts

The Level-2 data factory is the baseline method. The paper-inspired variants for scaling it are documented under docs/data-method-variants/:

01_template_project_mutation   extend the current synthetic project templates
02_repo_test_breaking          mutate real repos until existing tests fail
03_commit_backtranslation      start from real commits and generate issues
04_stateful_environment_tasks  verify final state and collateral damage
05_hybrid_verifier_selection   combine execution, static, and trace verifiers

Generated artifacts should go under data_runs/, not directly into git-tracked docs:

data_runs/<method>/raw/           source material and temporary generated assets
data_runs/<method>/candidates/    tasks before quality gates
data_runs/<method>/accepted/      train/dev/test JSONL after quality gates
data_runs/<method>/manifests/     manifest and validation reports
data_runs/<method>/trajectories/  agent rollouts
data_runs/<method>/logs/          generation, verification, and training logs

The repo keeps this directory structure, while large generated files are ignored by git. Once a run is stable, export its accepted split into benchmarks/ for smoke tests or training.

Reward And Metrics

The reward is verifier-led:

Component Value Purpose
Hidden tests pass +1.00 Main success signal.
Visible tests pass +0.20 Small shaping signal.
Finish action +0.05 Encourages explicit completion.
Per turn -0.02 Prefers concise successful trajectories.
Invalid action -0.20 Penalizes malformed or forbidden tool calls.

Hidden tests never appear in model context. The agent can call visible tests, but the final reward is decided by the hidden-test verifier.

Evaluation reports:

pass@k
avg_success_turns
invalid_action_rate
test_call_efficiency
reward_rank_accuracy
num_tasks
num_trajectories

reward_rank_accuracy checks whether, for the same task, successful trajectories rank above failing trajectories. It is a sanity check that the reward function is not rewarding the wrong behavior.

Training Commands

Evaluate the base policy:

python -m agentic_code_grpo.eval \
  --tasks benchmarks/mini_project_repair/test.jsonl \
  --policy base \
  --k 1 \
  --out trajectories/base_eval.jsonl

Train the simple GRPO baseline:

python -m agentic_code_grpo.grpo \
  --train benchmarks/mini_project_repair/train.jsonl \
  --test benchmarks/mini_project_repair/test.jsonl \
  --out runs/grpo_demo

Train the strict-online tool-call loop:

python -m agentic_code_grpo.online_grpo \
  --train benchmarks/mini_project_repair/train.jsonl \
  --test benchmarks/mini_project_repair/test.jsonl \
  --out runs/strict_online_grpo

Strict-online means every update uses fresh rollouts sampled from the current policy. Saved trajectories are useful for inspection, but their old advantages are not reused for later training.

Learning Path

If you want to understand the system from zero, read in this order:

1. README.md
2. agent.py
3. examples/first_agentic_rl/
4. docs/agentic-rl-explained.md
5. docs/data-engineering.md
6. docs/project-data-synthesis.md
7. docs/data-method-variants/
8. data_runs/README.md
9. agentic_code_grpo/online_rollout.py
10. agentic_code_grpo/online_grpo.py
11. docs/strict-online-rl-roadmap.md

The examples/first_agentic_rl/ folder contains six standalone scripts:

1-make-data.py
2-sandbox-env.py
3-rollout.py
4-reward.py
5-strict-online-train-loop.py
6-stitch-rl-system.py

They show how the full package grows from small pieces.

Scaling Toward Real Models

The intended path is:

agent.py
  -> package-level toy policy
  -> Transformers model client
  -> LoRA GRPO update
  -> vLLM or SGLang rollout serving
  -> distributed RL framework

When replacing the toy policy, keep these invariants:

  1. Hidden tests are reward-only and never enter prompts.
  2. Tool calls are parsed through a strict schema.
  3. Invalid actions are recorded and penalized.
  4. GRPO advantages are computed within same-task rollout groups.
  5. Old trajectories are logs, not a replay buffer for GRPO.
  6. Loss is applied to assistant action tokens, not tool observations.
  7. Rollouts carry policy/version metadata once generation is distributed.

See docs/frameworks.md for when TRL, OpenRLHF, verl, slime, NeMo RL, RLite, or RLinf become useful.

What This Repo Is Not

This is not a production sandbox for arbitrary untrusted code. The educational environment is useful for explaining interfaces, verifier rewards, and training freshness. A real code-agent RL system still needs stronger process isolation, timeouts, resource limits, network controls, artifact storage, and monitoring.

This is also not a SOTA claim. The mini benchmark is deliberately synthetic. Use it to validate the loop, metrics, and invariants before moving to harder external benchmarks.

Documentation

License

MIT

About

A focused Agentic RL experiment for code repair using multi-turn rollouts, verifiable rewards, and GRPO.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors