[SWE-ZERO] Quality validation: midtrain Marin-8B on SWE-ZERO trajectories and evaluate on SWE-bench

## Objective

Validate the quality of the SWE-ZERO 140B trajectory dataset ([#4719](https://github.com/marin-community/marin/issues/4719)) by continue-training Marin-8B base on a representative subset and measuring before/after on SWE-bench Verified and SWE-bench Multilingual.

**Dataset**: [`AlienKevin/SWE-ZERO-12M-trajectories`](https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories) — 1.45M clean trajectories (20B checkpoint)

**Deadline**: Monday April 21 (results needed for go/no-go on scaling to 140B)

## Experiment Design

### Training

| Parameter | Value | Rationale |
|---|---|---|
| Base model | Marin-8B base (or Qwen3-8B) | Matches existing SFT baselines (#3490, #3896, #4420) |
| Dataset | 100K trajectory subset from SWE-ZERO-12M | Representative sample; full 1.45M would take ~10 days |
| Sequence length | 32K | Matches SWE-ZERO generation config |
| Epochs | 1 | Speed — quality signal detectable in 1 epoch |
| Batch size | 16-128 (depending on TPU) | Follow #3490/#3896 configs |
| Learning rate | 2e-5 to 4e-5 | Standard for continue pre-training on agentic data |
| TPU | v5p-32 or v5p-64 | ~1-2 days training time |
| Format | mini-swe-agent v1 chat format | Same format the data was generated in |

**Estimated training time**: ~1-2 days on v5p-32 (100K trajectories × 1 epoch × 32K context)

Reference configs:
- #3490 (NemotronTerminal-8B reproduction): `exp3490b_sft_nemotron_terminal_corpus_qwen3_8b.py`
- #3896 (OT-Agent 32K reproduction): `exp3896_sft_ota_32k_qwen3_8b.py`
- #4420 (Marin-8B on Terminal-Corpus): `exp4420_sft_marin_8b_instruct_terminal_corpus.py`
- #4510 (compute estimates): v5p-64 = ~5 days for 366K examples

### Evaluation

Run Harbor evaluation on both the base model and the fine-tuned model:

| Benchmark | Dataset | Agent | Expected baseline |
|---|---|---|---|
| SWE-bench Verified | `swebench-verified@1.0` | mini-swe-agent v1 (terminus-2) | ~0% (untrained base) |
| SWE-bench Multilingual | `swebench-multilingual@1.0` | mini-swe-agent v1 (terminus-2) | ~0% (untrained base) |

Reference eval configs:
- #4307 (Harbor eval template): `exp4307_eval_released_nemotron_terminal_32b_tb2.py`
- #3846 (Harbor speedup techniques): task sharding, vLLM tuning
- #4683 (ConTree execution-based eval): alternative evaluation path

**Estimated eval time**: ~12-24 hours per benchmark on v5p-8 with task sharding

### Success Criteria

1. **Minimum**: Fine-tuned model shows any positive resolve rate on SWE-bench Verified (base model at ~0%)
2. **Good**: resolve rate ≥ 5% on SWE-bench Verified (comparable to ConTree pass@1 of 6.0% from #4666)
3. **Stretch**: resolve rate ≥ 10% on SWE-bench Verified

A negative result (no improvement) would indicate the trajectories need quality filtering beyond error/dedup removal — e.g., filtering to submitted-only rollouts, or increasing MAX_TURNS back to 30.

## Timeline

| Day | Task |
|-----|------|
| **Friday Apr 18** | File issue, write training config, launch training run |
| **Sat-Sun Apr 19-20** | Training completes (~1-2 days), launch Harbor evals |
| **Monday Apr 21** | Eval results, post analysis to this issue |

## Data Preparation

The 100K subset should be stratified:
- Sample proportionally across languages (matching SWE-rebench V2 distribution)
- Include both "Submitted" (8.9%) and "incomplete" (91.1%) rollouts
- Filter to MAX_TURNS=15 rollouts only (the ongoing production config) for consistency

## Related Issues

- #4719 — SWE-ZERO 140B scaling run (the dataset being validated)
- #4666 — SWE-ZERO 1B scaling (ConTree eval: pass@1=6.0%, pass@3=11.3%)
- #4653 — SWE-ZERO multilang validation (300 rollouts, 20 languages)
- #3490 — NemotronTerminal-8B reproduction (SFT workflow template)
- #3896 — OT-Agent 32K reproduction (32K context SFT)
- #4420 — Marin-8B on Terminal-Corpus
- #4510 — Compute estimates for 8B/32B post-training
- #4307 — Harbor SWE-bench evaluation template
- #3846 — Harbor eval speedup techniques

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SWE-ZERO] Quality validation: midtrain Marin-8B on SWE-ZERO trajectories and evaluate on SWE-bench #4898

Objective

Experiment Design

Training

Evaluation

Success Criteria

Timeline

Data Preparation

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parameter	Value	Rationale
Base model	Marin-8B base (or Qwen3-8B)	Matches existing SFT baselines (#3490, #3896, #4420)
Dataset	100K trajectory subset from SWE-ZERO-12M	Representative sample; full 1.45M would take ~10 days
Sequence length	32K	Matches SWE-ZERO generation config
Epochs	1	Speed — quality signal detectable in 1 epoch
Batch size	16-128 (depending on TPU)	Follow #3490/#3896 configs
Learning rate	2e-5 to 4e-5	Standard for continue pre-training on agentic data
TPU	v5p-32 or v5p-64	~1-2 days training time
Format	mini-swe-agent v1 chat format	Same format the data was generated in

Benchmark	Dataset	Agent	Expected baseline
SWE-bench Verified	`swebench-verified@1.0`	mini-swe-agent v1 (terminus-2)	~0% (untrained base)
SWE-bench Multilingual	`swebench-multilingual@1.0`	mini-swe-agent v1 (terminus-2)	~0% (untrained base)

Day	Task
Friday Apr 18	File issue, write training config, launch training run
Sat-Sun Apr 19-20	Training completes (~1-2 days), launch Harbor evals
Monday Apr 21	Eval results, post analysis to this issue

Uh oh!

[SWE-ZERO] Quality validation: midtrain Marin-8B on SWE-ZERO trajectories and evaluate on SWE-bench #4898

Description

Objective

Experiment Design

Training

Evaluation

Success Criteria

Timeline

Data Preparation

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions