Skip to content

[SWE-ZERO] Quality validation: midtrain Marin-8B on SWE-ZERO trajectories and evaluate on SWE-bench #4898

Description

@AlienKevin

Objective

Validate the quality of the SWE-ZERO 140B trajectory dataset (#4719) by continue-training Marin-8B base on a representative subset and measuring before/after on SWE-bench Verified and SWE-bench Multilingual.

Dataset: AlienKevin/SWE-ZERO-12M-trajectories — 1.45M clean trajectories (20B checkpoint)

Deadline: Monday April 21 (results needed for go/no-go on scaling to 140B)

Experiment Design

Training

Parameter Value Rationale
Base model Marin-8B base (or Qwen3-8B) Matches existing SFT baselines (#3490, #3896, #4420)
Dataset 100K trajectory subset from SWE-ZERO-12M Representative sample; full 1.45M would take ~10 days
Sequence length 32K Matches SWE-ZERO generation config
Epochs 1 Speed — quality signal detectable in 1 epoch
Batch size 16-128 (depending on TPU) Follow #3490/#3896 configs
Learning rate 2e-5 to 4e-5 Standard for continue pre-training on agentic data
TPU v5p-32 or v5p-64 ~1-2 days training time
Format mini-swe-agent v1 chat format Same format the data was generated in

Estimated training time: ~1-2 days on v5p-32 (100K trajectories × 1 epoch × 32K context)

Reference configs:

Evaluation

Run Harbor evaluation on both the base model and the fine-tuned model:

Benchmark Dataset Agent Expected baseline
SWE-bench Verified swebench-verified@1.0 mini-swe-agent v1 (terminus-2) ~0% (untrained base)
SWE-bench Multilingual swebench-multilingual@1.0 mini-swe-agent v1 (terminus-2) ~0% (untrained base)

Reference eval configs:

Estimated eval time: ~12-24 hours per benchmark on v5p-8 with task sharding

Success Criteria

  1. Minimum: Fine-tuned model shows any positive resolve rate on SWE-bench Verified (base model at ~0%)
  2. Good: resolve rate ≥ 5% on SWE-bench Verified (comparable to ConTree pass@1 of 6.0% from Experiment: SWE-ZERO scaling to 1B tokens (32k PRs × 3 rollouts) #4666)
  3. Stretch: resolve rate ≥ 10% on SWE-bench Verified

A negative result (no improvement) would indicate the trajectories need quality filtering beyond error/dedup removal — e.g., filtering to submitted-only rollouts, or increasing MAX_TURNS back to 30.

Timeline

Day Task
Friday Apr 18 File issue, write training config, launch training run
Sat-Sun Apr 19-20 Training completes (~1-2 days), launch Harbor evals
Monday Apr 21 Eval results, post analysis to this issue

Data Preparation

The 100K subset should be stratified:

  • Sample proportionally across languages (matching SWE-rebench V2 distribution)
  • Include both "Submitted" (8.9%) and "incomplete" (91.1%) rollouts
  • Filter to MAX_TURNS=15 rollouts only (the ongoing production config) for consistency

Related Issues

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions