Skip to content

Latest commit

 

History

History
46 lines (32 loc) · 1.56 KB

File metadata and controls

46 lines (32 loc) · 1.56 KB

Experiment Plan

Focus

This project should stay GRPO-only:

Level-2 Data Factory -> Base Agent -> on-policy grouped rollouts -> hidden-test reward -> GRPO Agent

The data factory is part of the experiment, not a preprocessing detail. Each training run should be able to point to the generated JSONL splits and the manifest that records quality gates.

Default data should be project-level: multiple files, list_files/read_file exploration, one source edit, visible tests for feedback, and hidden tests for the verifier.

Evidence Standard

Do not use the reward model score alone as evidence. Report external verifier metrics:

  • held-out hidden-test pass@1
  • pass@k
  • average turns among successful trajectories
  • invalid tool-call rate
  • test-call efficiency before success
  • reward ranking accuracy: same-task successful trajectories should rank above failures
  • data quality gates: buggy code fails, oracle code passes, and hidden tests are not leaked
  • project-level data gates: multiple files, repair file exists, visible tests run through imports

First Real Model Upgrade

Keep the current environment API stable and replace SoftmaxPolicy with:

  • Qwen/Qwen2.5-Coder-1.5B-Instruct or a Qwen3 small coder model
  • LoRA adapters
  • TRL GRPOTrainer
  • masking of tool observations when computing policy loss

The hidden-test runner should remain the reward source.

SWE-bench Positioning

Use SWE-bench Verified easy 50-task subset only as an external smoke test until the mini eval has stable 3-seed results. Phrase results as early transfer, not SOTA.