Kaggle-first repo for verifier-driven assembly generation around Qwen/Qwen3.5-2B.
Main tracks:
baseline: sample, verify, rerank, repair, evaluatewarm-start + GRPO: old one-shot training pathagentic repair GRPO: compiler-guided repair loop with grouped candidates per step
codeforgefinal/
assembly_swe/
configs/
agentic_grpo.qwen35_2b.yaml
base.yaml
grpo_config.yaml
grpo_config.qwen35_2b_phase1.yaml
grpo_config.qwen35_2b_phase2.yaml
data/
docs/
notebooks/
qwen35_asm_agentic_blueprint.py
prompts/
scripts/
bootstrap_kaggle.py
generate_synthetic_tasks.py
run_ranked_sampling.py
run_agentic_grpo.py
kaggle_agentic_qwen35_2b_pipeline.py
build_sft_dataset.py
eval.py
smoke_test.py
kaggle_autorun.py
kaggle_qwen35_2b_pipeline.py
src/
agentic.py
best_of_n.py
data.py
env.py
modeling.py
reward.py
trainer.py
utils.py
verifier.py
train.py
Add in Kaggle Secrets:
HF_TOKENWANDB_API_KEY
The Kaggle pipeline now loads these automatically at startup.
!git -C /kaggle/working/codeforgefinal pull || git clone https://github.com/PAMF2/codeforgefinal.git /kaggle/working/codeforgefinal
%cd /kaggle/working/codeforgefinal
!python scripts/bootstrap_kaggle.pySafe Kaggle preset:
!git -C /kaggle/working/codeforgefinal pull || git clone https://github.com/PAMF2/codeforgefinal.git /kaggle/working/codeforgefinal
%cd /kaggle/working/codeforgefinal
!python scripts/kaggle_agentic_qwen35_2b_pipeline.py --bootstrap-depsThis now defaults to a conservative first run:
core_size=1200repair_size=400dev_size=120hard_size=120iterations=2prompts_per_iteration=3num_candidates=2repair_steps=1max_episode_steps=2
If the first agentic pass fails once, the pipeline retries automatically with the same safe regime.
experiments/autoresearch_adapter/ is legacy exploratory material and is not the main GRPO
autoresearch path. The active path is scripts/run_autoresearch.py plus
experiments/autoresearch_grpo/target_config.yaml.
!python scripts/smoke_test.py --config configs/base.yamlThis is the first thing to run before longer training.
!python scripts/generate_synthetic_tasks.py \
--out-dir data/generated \
--core-size 5000 \
--repair-size 2000 \
--dev-size 500 \
--hard-size 800 \
--validate-sample 64Outputs:
data/generated/train.jsonldata/generated/dev.jsonldata/generated/hard.jsonldata/generated/private_eval.jsonldata/generated/generation_only.jsonldata/generated/repair_only.jsonldata/generated/manifest.json
!python scripts/run_ranked_sampling.py \
--config configs/base.yaml \
--tasks data/generated/dev.jsonl \
--out artifacts/predictions.jsonl \
--num-candidates 4 \
--repair-steps 1!python scripts/build_sft_dataset.py \
--tasks data/generated/dev.jsonl \
--predictions artifacts/predictions.jsonl \
--out artifacts/sft_pairs.jsonl!python scripts/eval.py \
--config configs/base.yaml \
--tasks data/generated/dev.jsonl \
--predictions artifacts/predictions.jsonl \
--ks 1,3,5This is the new Kaggle path for compiler-guided repair RL.
One command:
!python scripts/kaggle_agentic_qwen35_2b_pipeline.py --bootstrap-deps --iterations 12!python scripts/run_agentic_grpo.py \
--config configs/agentic_grpo.qwen35_2b.yaml \
--tasks data/generated/train.jsonl \
--iterations 12 \
--prompts-per-iteration 6 \
--num-candidates 4 \
--repair-steps 2 \
--max-episode-steps 3Artifacts written under:
artifacts/agentic_grpo/metrics.jsonlartifacts/agentic_grpo/trajectories/artifacts/agentic_grpo/sft/checkpoints/agentic_grpo/
Passing --run-autoresearch now makes autoresearch the main GRPO stage. Instead of one direct
run_agentic_grpo.py call, the pipeline launches scripts/run_autoresearch.py, which:
- runs a baseline GRPO experiment with the current config
- mutates the live GRPO config
- re-runs the real GRPO experiment
- marks it
keep,discard, orcrash - advances only if the metrics improved
Outputs are written to artifacts/autoresearch/grpo/<run_tag>/, including:
results.tsvruns.jsonbest_config.yamlbest_artifacts/- per-experiment
run.log
--autoresearch-time-budget is the per-experiment timeout in minutes.
This preserves the older path from the previous repo.
!python scripts/kaggle_qwen35_2b_pipeline.py --bootstrap-deps --phase1-hours 8 --phase2-hours 10Manual phase 1:
!python scripts/kaggle_autorun.py \
--root /kaggle/working/codeforgefinal \
--config configs/grpo_config.qwen35_2b_phase1.yaml \
--hours 8 \
--backend manual \
--safe-profile \
--batch-size 1 \
--generations-per-prompt 6 \
--prompts-per-iteration 8 \
--gradient-accumulation-steps 6 \
--max-new-tokens 128 \
--use-mcts-after-iteration 999Manual phase 2:
!python scripts/kaggle_autorun.py \
--root /kaggle/working/codeforgefinal \
--config configs/grpo_config.qwen35_2b_phase2.yaml \
--hours 10 \
--backend trl \
--safe-profile \
--batch-size 1 \
--generations-per-prompt 4 \
--prompts-per-iteration 6 \
--gradient-accumulation-steps 8 \
--max-new-tokens 128 \
--use-mcts-after-iteration 999!python assembly_swe/tools/eval_all_iters.py \
--repo-root . \
--tasks assembly_swe/datasets/dev_v1_30.jsonl \
--iter-start 1 \
--iter-end 30 \
--ks 1,3,5 \
--outdir assembly_swe/results/qwen35_2b_eval \
--load-in-4bit \
--hub-repo-id PAMF2/codeforgefinal-qwen35-2b \
--base-model Qwen/Qwen3.5-2B \
--max-new-tokens 128 \
--temperature 0.20 \
--top-p 0.80 \
--top-k 20 \
--repetition-penalty 1.05 \
--num-candidates 5 \
--verifier reward \
--verifier-timeout-sec 6 \
--repair-steps 1If you want the shortest path that actually tests the new stack:
python scripts/kaggle_agentic_qwen35_2b_pipeline.py --bootstrap-deps- if that works, scale up the explicit flags gradually
Then scale iterations and dataset size once the small run completes.