This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Minimal research scaffold for agentic RL on code repair. The full loop runs dependency-free with a toy policy; real models plug in later.
# Install (editable)
python -m pip install -e .
# Generate benchmark tasks
python -m agentic_code_grpo.tasks --out benchmarks/mini_code_repair
# Evaluate
python -m agentic_code_grpo.eval \
--tasks benchmarks/mini_code_repair/test.jsonl \
--policy base --k 1 \
--out trajectories/base_eval.jsonl
# Train (original GRPO)
python -m agentic_code_grpo.grpo \
--train benchmarks/mini_code_repair/train.jsonl \
--test benchmarks/mini_code_repair/test.jsonl \
--out runs/grpo_demo
# Train (strict-online GRPO)
python -m agentic_code_grpo.online_grpo \
--train benchmarks/mini_code_repair/train.jsonl \
--test benchmarks/mini_code_repair/test.jsonl \
--out runs/strict_online_grpo
# Full smoke test (generate → eval → train → online train)
./scripts/smoke.sh
# Unit tests
python -m pytest
# Run a single example
python examples/first_agentic_rl/1-make-data.pytests/ covers the core invariants. smoke.sh remains the integration test.
Optional dev dep: pytest>=8.
task → rollout actor → tool action
↓ ↓
trajectory ← environment
↓
hidden-test verifier
↓
reward + group advantage
↓
policy update
Each module is one concern:
tasks.py— Deterministic benchmark generator (120 train / 40 dev / 40 test tasks). Entry point.data.py—RepairTaskdataclass + JSONL read/write.env.py— Tool environment:read_file,write_file,run_tests,finish. Each action returns an observation.rollout.py— Fixed-strategy rollout baseline (operator fixes, parity flips, sort direction, strip, array indexing).online_rollout.py— JSON tool-call rollout loop for richer strategies.policy.py— Softmax policy over repair strategies, with update-from-advantage.reward.py— Hidden-test scoring: pass=1.0, visible=0.2, finish=0.05, penalties for extra turns and invalid actions. Also computesreward_rank_accuracy.metrics.py— Shared evaluation aggregation forpass@k, invalid action rate, success turns, test-call efficiency, and reward ranking.grpo.py— Minimal grouped policy optimization training loop.online_grpo.py— Strict-online variant: fresh rollouts per update, no replay.eval.py— Evaluation runner:pass@k, reward ranking, invalid action rate.
Six numbered scripts building up from data to full system. Each is self-contained and runnable independently.
- Hidden tests are reward-only — never appear in model context.
- GRPO updates use fresh rollouts from current policy (no stale replay).
- Tool calls parsed through strict JSON schema; invalid actions penalized.
- Tool observations condition future actions but must be masked from policy loss.
- The toy policy is the reference; real models replace it at the rollout boundary.
- Pure Python, no runtime dependencies for baseline.
- Each module is a standalone
__main__runnable withargparse. - Tasks are JSONL: one
RepairTaskper line. - Output directories:
benchmarks/,trajectories/,runs/. - Bilingual docs: English (
README.md,docs/*.md) and Chinese (README_CN.md,docs/*_CN.md).