You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Validate the quality of the SWE-ZERO 140B trajectory dataset (#4719) by continue-training Marin-8B base on a representative subset and measuring before/after on SWE-bench Verified and SWE-bench Multilingual.
A negative result (no improvement) would indicate the trajectories need quality filtering beyond error/dedup removal — e.g., filtering to submitted-only rollouts, or increasing MAX_TURNS back to 30.
Timeline
Day
Task
Friday Apr 18
File issue, write training config, launch training run
Sat-Sun Apr 19-20
Training completes (~1-2 days), launch Harbor evals
Monday Apr 21
Eval results, post analysis to this issue
Data Preparation
The 100K subset should be stratified:
Sample proportionally across languages (matching SWE-rebench V2 distribution)
Include both "Submitted" (8.9%) and "incomplete" (91.1%) rollouts
Filter to MAX_TURNS=15 rollouts only (the ongoing production config) for consistency
Objective
Validate the quality of the SWE-ZERO 140B trajectory dataset (#4719) by continue-training Marin-8B base on a representative subset and measuring before/after on SWE-bench Verified and SWE-bench Multilingual.
Dataset:
AlienKevin/SWE-ZERO-12M-trajectories— 1.45M clean trajectories (20B checkpoint)Deadline: Monday April 21 (results needed for go/no-go on scaling to 140B)
Experiment Design
Training
Estimated training time: ~1-2 days on v5p-32 (100K trajectories × 1 epoch × 32K context)
Reference configs:
exp3490b_sft_nemotron_terminal_corpus_qwen3_8b.pyexp3896_sft_ota_32k_qwen3_8b.pyexp4420_sft_marin_8b_instruct_terminal_corpus.pyEvaluation
Run Harbor evaluation on both the base model and the fine-tuned model:
swebench-verified@1.0swebench-multilingual@1.0Reference eval configs:
exp4307_eval_released_nemotron_terminal_32b_tb2.pyEstimated eval time: ~12-24 hours per benchmark on v5p-8 with task sharding
Success Criteria
A negative result (no improvement) would indicate the trajectories need quality filtering beyond error/dedup removal — e.g., filtering to submitted-only rollouts, or increasing MAX_TURNS back to 30.
Timeline
Data Preparation
The 100K subset should be stratified:
Related Issues