This project should stay GRPO-only:
Level-2 Data Factory -> Base Agent -> on-policy grouped rollouts -> hidden-test reward -> GRPO Agent
The data factory is part of the experiment, not a preprocessing detail. Each training run should be able to point to the generated JSONL splits and the manifest that records quality gates.
Default data should be project-level: multiple files, list_files/read_file
exploration, one source edit, visible tests for feedback, and hidden tests for
the verifier.
Do not use the reward model score alone as evidence. Report external verifier metrics:
- held-out hidden-test
pass@1 pass@k- average turns among successful trajectories
- invalid tool-call rate
- test-call efficiency before success
- reward ranking accuracy: same-task successful trajectories should rank above failures
- data quality gates: buggy code fails, oracle code passes, and hidden tests are not leaked
- project-level data gates: multiple files, repair file exists, visible tests run through imports
Keep the current environment API stable and replace SoftmaxPolicy with:
Qwen/Qwen2.5-Coder-1.5B-Instructor a Qwen3 small coder model- LoRA adapters
- TRL
GRPOTrainer - masking of tool observations when computing policy loss
The hidden-test runner should remain the reward source.
Use SWE-bench Verified easy 50-task subset only as an external smoke test until the mini eval has stable 3-seed results. Phrase results as early transfer, not SOTA.