You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scale the execution-free SWE-ZERO recipe (#4561, now complete for Python) to all 20 programming languages in SWE-rebench V2.
Background
The SWE-ZERO MVP produced 1000 high-quality agentic rollouts on 100 Python PRs from SWE-rebench V2 in 38 minutes on a v6e-8 worker. Final recipe: ricdomolm/mini-coder-1.7b + mini-swe-agent v1 scaffold + PATH-whitelist bash sandbox + 32K context. Results at #4561 (closing comment) and HF dataset AlienKevin/SWE-ZERO-1k-trajectories-32k.
SWE-rebench V2 has 32,079 instances across 20 languages:
language
PRs available
language
PRs available
python
7,243
swift
362
go
6,144
dart
251
ts (TypeScript)
4,204
c
230
js (JavaScript)
4,138
cpp
182
rust
3,123
csharp
173
java
1,716
r
157
php
1,445
clojure
105
kotlin
889
ocaml
58
julia
793
lua
39
elixir
416
scala
411
Goal
Get a first look at how the Python recipe generalizes to other languages — specifically whether mini-coder-1.7b (trained primarily on Python data) still produces grounded multi-turn bash trajectories when pointed at Go, Rust, TypeScript, etc.
Sampling plan
100 PRs total spanning all 20 languages with as much coverage as possible
3 rollouts per PR (total = 300 rollouts) — chosen to keep the experiment cheap while still letting us compute a meaningful per-PR submission rate
5 PRs per language for the 20 languages. This gives uniform coverage regardless of how much training data each language has in SWE-rebench V2 (some languages only have 39–58 instances).
Use seed=7 for deterministic sampling per language, matching prior steps
Same filter: only include PRs where the base repo can be shallow-cloned and test_patch applies cleanly (if any don't, fall back to the next PR in that language's seeded sample)
A new entry point run_swe_zero_multilang.py will be added that drops the language_filter="python" in SWERebenchV2Loader and samples 5 PRs per language instead of 10 rollouts × 10 PRs × 10 repos.
Submission rate (fraction of rollouts exiting via echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT)
Estimated resolve rate at empirical pass@1 and pass@3 — replay each sampled rollout in a fresh worktree to recover the model's effective source-code diff, then LLM-judge it against the SWE-rebench V2 gold_patch for semantic equivalence (same conservative sub-agent judge as Experiment: SWE-ZERO MVP — Execution-free agentic rollout generation #4561's resolve-rate sample). For each PR with s resolved rollouts out of n=3, compute 1 - C(n-s, k)/C(n, k) and average across PRs to get pass@k. Report:
Resolved pass@1 (fraction of individual rollouts that produce a correct fix)
Resolved pass@3 (fraction of PRs with at least one correct fix in their 3 rollouts)
The Python MVP showed ~13% pass@1 and ~20% pass@10 — the multilang point is to see how those numbers degrade per language
Mean turns / rollout, mean completion tokens / rollout
command not found observations — does the model try language-specific interpreters (node, rustc, go, gradle, …) that aren't on the PATH whitelist?
Diversity: corpus-wide AND within-PR mean Jaccard (the within-PR breakdown is the honest one — the corpus-wide number is dominated by trivially-different cross-PR pairs)
Top-15 bash command first-words per language — does the exploration idiom (find | grep | cat) transfer, or does the model default to Python patterns?
Hypotheses
Submission rate will drop for non-Python languages because the system prompt's examples and the model's SFT data are Python-biased (find . -name "*.py").
command not found rate will climb for compiled / JVM / JS languages — the model will reach for go build, cargo check, tsc, gradle based on Python-side intuitions about "verify the fix".
Exploration idioms will transfer — grep/cat/find/sed are language-agnostic, so the first 2–3 turns should still be templated.
Resolve rate will be very low (well under 13% Python pass@1) for languages with structural build systems where a simple sed -i patch is insufficient.
Deliverables
New script: experiments/swe_zero/run_swe_zero_multilang.py
Raw rollouts: gs://marin-us-central2/experiments/swe_zero_multilang/rollouts.json
HF dataset: AlienKevin/SWE-ZERO-multilang-300-trajectories in the same chat-viewer format as the Python datasets
Follow-up comment on this issue with the per-language table, hypotheses-check, and recommendations for the next scale step (more languages? more PRs per language? Python-leaning model swap for non-Python langs?)
Scale the execution-free SWE-ZERO recipe (#4561, now complete for Python) to all 20 programming languages in SWE-rebench V2.
Background
The SWE-ZERO MVP produced 1000 high-quality agentic rollouts on 100 Python PRs from SWE-rebench V2 in 38 minutes on a v6e-8 worker. Final recipe:
ricdomolm/mini-coder-1.7b+ mini-swe-agent v1 scaffold + PATH-whitelist bash sandbox + 32K context. Results at #4561 (closing comment) and HF datasetAlienKevin/SWE-ZERO-1k-trajectories-32k.SWE-rebench V2 has 32,079 instances across 20 languages:
Goal
Get a first look at how the Python recipe generalizes to other languages — specifically whether
mini-coder-1.7b(trained primarily on Python data) still produces grounded multi-turn bash trajectories when pointed at Go, Rust, TypeScript, etc.Sampling plan
seed=7for deterministic sampling per language, matching prior stepstest_patchapplies cleanly (if any don't, fall back to the next PR in that language's seeded sample)Recipe (unchanged from the MVP final config)
uv run iris --cluster marin job run \ --job-name swe-zero-multilang \ --tpu v6e-4,v6e-8 --enable-extra-resources \ --cpu 16 --memory 24GB --disk 60GB \ --extra vllm \ --env-vars VLLM_TPU_SKIP_PRECOMPILE 1 \ --env-vars VLLM_ALLOW_LONG_MAX_MODEL_LEN 1 \ --env-vars VLLM_TPU_DISABLE_TOPK_TOPP_OPTIMIZATION 1 \ --env-vars MARIN_VLLM_MODE native \ --env-vars HF_TOKEN $HF_TOKEN \ -- python experiments/swe_zero/run_swe_zero_multilang.py --local \ --model ricdomolm/mini-coder-1.7b \ --n-prs 100 --n-rollouts 3 \ --tensor-parallel-size 4 --max-num-seqs 256 \ --max-model-len 32768 --max-total-tokens 32768 \ --concurrency 64 \ --output_dir gs://marin-us-central2/experiments/swe_zero_multilangA new entry point
run_swe_zero_multilang.pywill be added that drops thelanguage_filter="python"inSWERebenchV2Loaderand samples 5 PRs per language instead of 10 rollouts × 10 PRs × 10 repos.Estimated wall time: 300 rollouts × ~2.3s/rollout (at 1.83 rps, v6e-4 TP=4) ≈ 7–10 min.
Metrics to report
Per language (and aggregate):
echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT)gold_patchfor semantic equivalence (same conservative sub-agent judge as Experiment: SWE-ZERO MVP — Execution-free agentic rollout generation #4561's resolve-rate sample). For each PR withsresolved rollouts out ofn=3, compute1 - C(n-s, k)/C(n, k)and average across PRs to getpass@k. Report:command not foundobservations — does the model try language-specific interpreters (node,rustc,go,gradle, …) that aren't on the PATH whitelist?find | grep | cat) transfer, or does the model default to Python patterns?Hypotheses
find . -name "*.py").command not foundrate will climb for compiled / JVM / JS languages — the model will reach forgo build,cargo check,tsc,gradlebased on Python-side intuitions about "verify the fix".grep/cat/find/sedare language-agnostic, so the first 2–3 turns should still be templated.sed -ipatch is insufficient.Deliverables
experiments/swe_zero/run_swe_zero_multilang.pygs://marin-us-central2/experiments/swe_zero_multilang/rollouts.jsongs://.../multilang_report.jsonAlienKevin/SWE-ZERO-multilang-300-trajectoriesin the same chat-viewer format as the Python datasetsPart of #4435