CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project

Minimal research scaffold for agentic RL on code repair. The full loop runs dependency-free with a toy policy; real models plug in later.

Commands

# Install (editable)
python -m pip install -e .

# Generate benchmark tasks
python -m agentic_code_grpo.tasks --out benchmarks/mini_code_repair

# Evaluate
python -m agentic_code_grpo.eval \
  --tasks benchmarks/mini_code_repair/test.jsonl \
  --policy base --k 1 \
  --out trajectories/base_eval.jsonl

# Train (original GRPO)
python -m agentic_code_grpo.grpo \
  --train benchmarks/mini_code_repair/train.jsonl \
  --test benchmarks/mini_code_repair/test.jsonl \
  --out runs/grpo_demo

# Train (strict-online GRPO)
python -m agentic_code_grpo.online_grpo \
  --train benchmarks/mini_code_repair/train.jsonl \
  --test benchmarks/mini_code_repair/test.jsonl \
  --out runs/strict_online_grpo

# Full smoke test (generate → eval → train → online train)
./scripts/smoke.sh

# Unit tests
python -m pytest

# Run a single example
python examples/first_agentic_rl/1-make-data.py

tests/ covers the core invariants. smoke.sh remains the integration test. Optional dev dep: pytest>=8.

Architecture

task → rollout actor → tool action
                ↓              ↓
         trajectory ← environment
                ↓
       hidden-test verifier
                ↓
     reward + group advantage
                ↓
          policy update

Core loop (`agentic_code_grpo/`)

Each module is one concern:

tasks.py — Deterministic benchmark generator (120 train / 40 dev / 40 test tasks). Entry point.
data.py — RepairTask dataclass + JSONL read/write.
env.py — Tool environment: read_file, write_file, run_tests, finish. Each action returns an observation.
rollout.py — Fixed-strategy rollout baseline (operator fixes, parity flips, sort direction, strip, array indexing).
online_rollout.py — JSON tool-call rollout loop for richer strategies.
policy.py — Softmax policy over repair strategies, with update-from-advantage.
reward.py — Hidden-test scoring: pass=1.0, visible=0.2, finish=0.05, penalties for extra turns and invalid actions. Also computes reward_rank_accuracy.
metrics.py — Shared evaluation aggregation for pass@k, invalid action rate, success turns, test-call efficiency, and reward ranking.
grpo.py — Minimal grouped policy optimization training loop.
online_grpo.py — Strict-online variant: fresh rollouts per update, no replay.
eval.py — Evaluation runner: pass@k, reward ranking, invalid action rate.

Learning path (`examples/first_agentic_rl/`)

Six numbered scripts building up from data to full system. Each is self-contained and runnable independently.

Key Design Invariants

Hidden tests are reward-only — never appear in model context.
GRPO updates use fresh rollouts from current policy (no stale replay).
Tool calls parsed through strict JSON schema; invalid actions penalized.
Tool observations condition future actions but must be masked from policy loss.
The toy policy is the reference; real models replace it at the rollout boundary.

Code Conventions

Pure Python, no runtime dependencies for baseline.
Each module is a standalone __main__ runnable with argparse.
Tasks are JSONL: one RepairTask per line.
Output directories: benchmarks/, trajectories/, runs/.
Bilingual docs: English (README.md, docs/*.md) and Chinese (README_CN.md, docs/*_CN.md).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project

Commands

Architecture

Core loop (`agentic_code_grpo/`)

Learning path (`examples/first_agentic_rl/`)

Key Design Invariants

Code Conventions

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project

Commands

Architecture

Core loop (agentic_code_grpo/)

Learning path (examples/first_agentic_rl/)

Key Design Invariants

Code Conventions

Core loop (`agentic_code_grpo/`)

Learning path (`examples/first_agentic_rl/`)