Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs)

Scale the execution-free SWE-ZERO recipe (marin-community/marin#4561, now complete for Python) to **all 20 programming languages** in SWE-rebench V2.

## Background

The SWE-ZERO MVP produced 1000 high-quality agentic rollouts on 100 **Python** PRs from SWE-rebench V2 in 38 minutes on a v6e-8 worker. Final recipe: `ricdomolm/mini-coder-1.7b` + mini-swe-agent v1 scaffold + PATH-whitelist bash sandbox + 32K context. Results at [#4561 (closing comment)](https://github.com/marin-community/marin/issues/4561#issuecomment-4228533964) and HF dataset [`AlienKevin/SWE-ZERO-1k-trajectories-32k`](https://huggingface.co/datasets/AlienKevin/SWE-ZERO-1k-trajectories-32k).

SWE-rebench V2 has **32,079 instances across 20 languages**:

| language | PRs available | | language | PRs available |
|---|---:|---|---|---:|
| python | 7,243 | | swift | 362 |
| go | 6,144 | | dart | 251 |
| ts (TypeScript) | 4,204 | | c | 230 |
| js (JavaScript) | 4,138 | | cpp | 182 |
| rust | 3,123 | | csharp | 173 |
| java | 1,716 | | r | 157 |
| php | 1,445 | | clojure | 105 |
| kotlin | 889 | | ocaml | 58 |
| julia | 793 | | lua | 39 |
| elixir | 416 | | | |
| scala | 411 | | | |

## Goal

Get a first look at how the Python recipe generalizes to other languages — specifically whether `mini-coder-1.7b` (trained primarily on Python data) still produces grounded multi-turn bash trajectories when pointed at Go, Rust, TypeScript, etc.

## Sampling plan

- **100 PRs total** spanning **all 20 languages** with as much coverage as possible
- **3 rollouts per PR** (total = 300 rollouts) — chosen to keep the experiment cheap while still letting us compute a meaningful per-PR submission rate
- **5 PRs per language** for the 20 languages. This gives uniform coverage regardless of how much training data each language has in SWE-rebench V2 (some languages only have 39–58 instances).
- Use `seed=7` for deterministic sampling per language, matching prior steps
- Same filter: only include PRs where the base repo can be shallow-cloned and `test_patch` applies cleanly (if any don't, fall back to the next PR in that language's seeded sample)

## Recipe (unchanged from the MVP final config)

```bash
uv run iris --cluster marin job run \
  --job-name swe-zero-multilang \
  --tpu v6e-4,v6e-8 --enable-extra-resources \
  --cpu 16 --memory 24GB --disk 60GB \
  --extra vllm \
  --env-vars VLLM_TPU_SKIP_PRECOMPILE 1 \
  --env-vars VLLM_ALLOW_LONG_MAX_MODEL_LEN 1 \
  --env-vars VLLM_TPU_DISABLE_TOPK_TOPP_OPTIMIZATION 1 \
  --env-vars MARIN_VLLM_MODE native \
  --env-vars HF_TOKEN $HF_TOKEN \
  -- python experiments/swe_zero/run_swe_zero_multilang.py --local \
       --model ricdomolm/mini-coder-1.7b \
       --n-prs 100 --n-rollouts 3 \
       --tensor-parallel-size 4 --max-num-seqs 256 \
       --max-model-len 32768 --max-total-tokens 32768 \
       --concurrency 64 \
       --output_dir gs://marin-us-central2/experiments/swe_zero_multilang
```

A new entry point `run_swe_zero_multilang.py` will be added that drops the `language_filter="python"` in `SWERebenchV2Loader` and samples 5 PRs per language instead of 10 rollouts × 10 PRs × 10 repos.

Estimated wall time: 300 rollouts × ~2.3s/rollout (at 1.83 rps, v6e-4 TP=4) ≈ **7–10 min**.

## Metrics to report

Per language (and aggregate):
- **Submission rate** (fraction of rollouts exiting via `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`)
- **Estimated resolve rate at empirical pass@1 and pass@3** — replay each sampled rollout in a fresh worktree to recover the model's effective source-code diff, then LLM-judge it against the SWE-rebench V2 `gold_patch` for semantic equivalence (same conservative sub-agent judge as #4561's resolve-rate sample). For each PR with `s` resolved rollouts out of `n=3`, compute `1 - C(n-s, k)/C(n, k)` and average across PRs to get `pass@k`. Report:
  - **Resolved pass@1** (fraction of individual rollouts that produce a correct fix)
  - **Resolved pass@3** (fraction of PRs with at least one correct fix in their 3 rollouts)
  - The Python MVP showed ~13% pass@1 and ~20% pass@10 — the multilang point is to see how those numbers degrade per language
- **Mean turns / rollout**, **mean completion tokens / rollout**
- **`command not found` observations** — does the model try language-specific interpreters (`node`, `rustc`, `go`, `gradle`, …) that aren't on the PATH whitelist?
- **Diversity**: corpus-wide AND within-PR mean Jaccard (the within-PR breakdown is the honest one — the corpus-wide number is dominated by trivially-different cross-PR pairs)
- **Top-15 bash command first-words per language** — does the exploration idiom (`find | grep | cat`) transfer, or does the model default to Python patterns?

## Hypotheses

1. **Submission rate will drop for non-Python languages** because the system prompt's examples and the model's SFT data are Python-biased (`find . -name "*.py"`).
2. **`command not found` rate will climb** for compiled / JVM / JS languages — the model will reach for `go build`, `cargo check`, `tsc`, `gradle` based on Python-side intuitions about "verify the fix".
3. **Exploration idioms will transfer** — `grep`/`cat`/`find`/`sed` are language-agnostic, so the first 2–3 turns should still be templated.
4. **Resolve rate will be very low** (well under 13% Python pass@1) for languages with structural build systems where a simple `sed -i` patch is insufficient.

## Deliverables

- New script: `experiments/swe_zero/run_swe_zero_multilang.py`
- Raw rollouts: `gs://marin-us-central2/experiments/swe_zero_multilang/rollouts.json`
- Per-language report: `gs://.../multilang_report.json`
- HF dataset: `AlienKevin/SWE-ZERO-multilang-300-trajectories` in the same chat-viewer format as the Python datasets
- Follow-up comment on this issue with the per-language table, hypotheses-check, and recommendations for the next scale step (more languages? more PRs per language? Python-leaning model swap for non-Python langs?)

Part of marin-community/marin#4435



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs) #4653

Background

Goal

Sampling plan

Recipe (unchanged from the MVP final config)

Metrics to report

Hypotheses

Deliverables

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

language	PRs available	language	PRs available
python	7,243	swift	362
go	6,144	dart	251
ts (TypeScript)	4,204	c	230
js (JavaScript)	4,138	cpp	182
rust	3,123	csharp	173
java	1,716	r	157
php	1,445	clojure	105
kotlin	889	ocaml	58
julia	793	lua	39
elixir	416
scala	411

Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs) #4653

Description

Background

Goal

Sampling plan

Recipe (unchanged from the MVP final config)

Metrics to report

Hypotheses

Deliverables

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions