Skip to content

Latest commit

 

History

History
97 lines (73 loc) · 3.61 KB

File metadata and controls

97 lines (73 loc) · 3.61 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project

Minimal research scaffold for agentic RL on code repair. The full loop runs dependency-free with a toy policy; real models plug in later.

Commands

# Install (editable)
python -m pip install -e .

# Generate benchmark tasks
python -m agentic_code_grpo.tasks --out benchmarks/mini_code_repair

# Evaluate
python -m agentic_code_grpo.eval \
  --tasks benchmarks/mini_code_repair/test.jsonl \
  --policy base --k 1 \
  --out trajectories/base_eval.jsonl

# Train (original GRPO)
python -m agentic_code_grpo.grpo \
  --train benchmarks/mini_code_repair/train.jsonl \
  --test benchmarks/mini_code_repair/test.jsonl \
  --out runs/grpo_demo

# Train (strict-online GRPO)
python -m agentic_code_grpo.online_grpo \
  --train benchmarks/mini_code_repair/train.jsonl \
  --test benchmarks/mini_code_repair/test.jsonl \
  --out runs/strict_online_grpo

# Full smoke test (generate → eval → train → online train)
./scripts/smoke.sh

# Unit tests
python -m pytest

# Run a single example
python examples/first_agentic_rl/1-make-data.py

tests/ covers the core invariants. smoke.sh remains the integration test. Optional dev dep: pytest>=8.

Architecture

task → rollout actor → tool action
                ↓              ↓
         trajectory ← environment
                ↓
       hidden-test verifier
                ↓
     reward + group advantage
                ↓
          policy update

Core loop (agentic_code_grpo/)

Each module is one concern:

  • tasks.py — Deterministic benchmark generator (120 train / 40 dev / 40 test tasks). Entry point.
  • data.pyRepairTask dataclass + JSONL read/write.
  • env.py — Tool environment: read_file, write_file, run_tests, finish. Each action returns an observation.
  • rollout.py — Fixed-strategy rollout baseline (operator fixes, parity flips, sort direction, strip, array indexing).
  • online_rollout.py — JSON tool-call rollout loop for richer strategies.
  • policy.py — Softmax policy over repair strategies, with update-from-advantage.
  • reward.py — Hidden-test scoring: pass=1.0, visible=0.2, finish=0.05, penalties for extra turns and invalid actions. Also computes reward_rank_accuracy.
  • metrics.py — Shared evaluation aggregation for pass@k, invalid action rate, success turns, test-call efficiency, and reward ranking.
  • grpo.py — Minimal grouped policy optimization training loop.
  • online_grpo.py — Strict-online variant: fresh rollouts per update, no replay.
  • eval.py — Evaluation runner: pass@k, reward ranking, invalid action rate.

Learning path (examples/first_agentic_rl/)

Six numbered scripts building up from data to full system. Each is self-contained and runnable independently.

Key Design Invariants

  1. Hidden tests are reward-only — never appear in model context.
  2. GRPO updates use fresh rollouts from current policy (no stale replay).
  3. Tool calls parsed through strict JSON schema; invalid actions penalized.
  4. Tool observations condition future actions but must be masked from policy loss.
  5. The toy policy is the reference; real models replace it at the rollout boundary.

Code Conventions

  • Pure Python, no runtime dependencies for baseline.
  • Each module is a standalone __main__ runnable with argparse.
  • Tasks are JSONL: one RepairTask per line.
  • Output directories: benchmarks/, trajectories/, runs/.
  • Bilingual docs: English (README.md, docs/*.md) and Chinese (README_CN.md, docs/*_CN.md).