Skip to content

Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs) #4653

@AlienKevin

Description

@AlienKevin

Scale the execution-free SWE-ZERO recipe (#4561, now complete for Python) to all 20 programming languages in SWE-rebench V2.

Background

The SWE-ZERO MVP produced 1000 high-quality agentic rollouts on 100 Python PRs from SWE-rebench V2 in 38 minutes on a v6e-8 worker. Final recipe: ricdomolm/mini-coder-1.7b + mini-swe-agent v1 scaffold + PATH-whitelist bash sandbox + 32K context. Results at #4561 (closing comment) and HF dataset AlienKevin/SWE-ZERO-1k-trajectories-32k.

SWE-rebench V2 has 32,079 instances across 20 languages:

language PRs available language PRs available
python 7,243 swift 362
go 6,144 dart 251
ts (TypeScript) 4,204 c 230
js (JavaScript) 4,138 cpp 182
rust 3,123 csharp 173
java 1,716 r 157
php 1,445 clojure 105
kotlin 889 ocaml 58
julia 793 lua 39
elixir 416
scala 411

Goal

Get a first look at how the Python recipe generalizes to other languages — specifically whether mini-coder-1.7b (trained primarily on Python data) still produces grounded multi-turn bash trajectories when pointed at Go, Rust, TypeScript, etc.

Sampling plan

  • 100 PRs total spanning all 20 languages with as much coverage as possible
  • 3 rollouts per PR (total = 300 rollouts) — chosen to keep the experiment cheap while still letting us compute a meaningful per-PR submission rate
  • 5 PRs per language for the 20 languages. This gives uniform coverage regardless of how much training data each language has in SWE-rebench V2 (some languages only have 39–58 instances).
  • Use seed=7 for deterministic sampling per language, matching prior steps
  • Same filter: only include PRs where the base repo can be shallow-cloned and test_patch applies cleanly (if any don't, fall back to the next PR in that language's seeded sample)

Recipe (unchanged from the MVP final config)

uv run iris --cluster marin job run \
  --job-name swe-zero-multilang \
  --tpu v6e-4,v6e-8 --enable-extra-resources \
  --cpu 16 --memory 24GB --disk 60GB \
  --extra vllm \
  --env-vars VLLM_TPU_SKIP_PRECOMPILE 1 \
  --env-vars VLLM_ALLOW_LONG_MAX_MODEL_LEN 1 \
  --env-vars VLLM_TPU_DISABLE_TOPK_TOPP_OPTIMIZATION 1 \
  --env-vars MARIN_VLLM_MODE native \
  --env-vars HF_TOKEN $HF_TOKEN \
  -- python experiments/swe_zero/run_swe_zero_multilang.py --local \
       --model ricdomolm/mini-coder-1.7b \
       --n-prs 100 --n-rollouts 3 \
       --tensor-parallel-size 4 --max-num-seqs 256 \
       --max-model-len 32768 --max-total-tokens 32768 \
       --concurrency 64 \
       --output_dir gs://marin-us-central2/experiments/swe_zero_multilang

A new entry point run_swe_zero_multilang.py will be added that drops the language_filter="python" in SWERebenchV2Loader and samples 5 PRs per language instead of 10 rollouts × 10 PRs × 10 repos.

Estimated wall time: 300 rollouts × ~2.3s/rollout (at 1.83 rps, v6e-4 TP=4) ≈ 7–10 min.

Metrics to report

Per language (and aggregate):

  • Submission rate (fraction of rollouts exiting via echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT)
  • Estimated resolve rate at empirical pass@1 and pass@3 — replay each sampled rollout in a fresh worktree to recover the model's effective source-code diff, then LLM-judge it against the SWE-rebench V2 gold_patch for semantic equivalence (same conservative sub-agent judge as Experiment: SWE-ZERO MVP — Execution-free agentic rollout generation #4561's resolve-rate sample). For each PR with s resolved rollouts out of n=3, compute 1 - C(n-s, k)/C(n, k) and average across PRs to get pass@k. Report:
    • Resolved pass@1 (fraction of individual rollouts that produce a correct fix)
    • Resolved pass@3 (fraction of PRs with at least one correct fix in their 3 rollouts)
    • The Python MVP showed ~13% pass@1 and ~20% pass@10 — the multilang point is to see how those numbers degrade per language
  • Mean turns / rollout, mean completion tokens / rollout
  • command not found observations — does the model try language-specific interpreters (node, rustc, go, gradle, …) that aren't on the PATH whitelist?
  • Diversity: corpus-wide AND within-PR mean Jaccard (the within-PR breakdown is the honest one — the corpus-wide number is dominated by trivially-different cross-PR pairs)
  • Top-15 bash command first-words per language — does the exploration idiom (find | grep | cat) transfer, or does the model default to Python patterns?

Hypotheses

  1. Submission rate will drop for non-Python languages because the system prompt's examples and the model's SFT data are Python-biased (find . -name "*.py").
  2. command not found rate will climb for compiled / JVM / JS languages — the model will reach for go build, cargo check, tsc, gradle based on Python-side intuitions about "verify the fix".
  3. Exploration idioms will transfergrep/cat/find/sed are language-agnostic, so the first 2–3 turns should still be templated.
  4. Resolve rate will be very low (well under 13% Python pass@1) for languages with structural build systems where a simple sed -i patch is insufficient.

Deliverables

  • New script: experiments/swe_zero/run_swe_zero_multilang.py
  • Raw rollouts: gs://marin-us-central2/experiments/swe_zero_multilang/rollouts.json
  • Per-language report: gs://.../multilang_report.json
  • HF dataset: AlienKevin/SWE-ZERO-multilang-300-trajectories in the same chat-viewer format as the Python datasets
  • Follow-up comment on this issue with the per-language table, hypotheses-check, and recommendations for the next scale step (more languages? more PRs per language? Python-leaning model swap for non-Python langs?)

Part of #4435

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions