Experiment Plan

Focus

This project should stay GRPO-only:

Level-2 Data Factory -> Base Agent -> on-policy grouped rollouts -> hidden-test reward -> GRPO Agent

The data factory is part of the experiment, not a preprocessing detail. Each training run should be able to point to the generated JSONL splits and the manifest that records quality gates.

Default data should be project-level: multiple files, list_files/read_file exploration, one source edit, visible tests for feedback, and hidden tests for the verifier.

Evidence Standard

Do not use the reward model score alone as evidence. Report external verifier metrics:

held-out hidden-test pass@1
pass@k
average turns among successful trajectories
invalid tool-call rate
test-call efficiency before success
reward ranking accuracy: same-task successful trajectories should rank above failures
data quality gates: buggy code fails, oracle code passes, and hidden tests are not leaked
project-level data gates: multiple files, repair file exists, visible tests run through imports

First Real Model Upgrade

Keep the current environment API stable and replace SoftmaxPolicy with:

Qwen/Qwen2.5-Coder-1.5B-Instruct or a Qwen3 small coder model
LoRA adapters
TRL GRPOTrainer
masking of tool observations when computing policy loss

The hidden-test runner should remain the reward source.

SWE-bench Positioning

Use SWE-bench Verified easy 50-task subset only as an external smoke test until the mini eval has stable 3-seed results. Phrase results as early transfer, not SOTA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment Plan

Focus

Evidence Standard

First Real Model Upgrade

SWE-bench Positioning

FilesExpand file tree

experiment-plan.md

Latest commit

History

experiment-plan.md

File metadata and controls

Experiment Plan

Focus

Evidence Standard

First Real Model Upgrade

SWE-bench Positioning