[rl] Add Delphi RL MATH-500 scaling probe launcher by nikil-ravi · Pull Request #6325 · marin-community/marin

nikil-ravi · 2026-06-11T01:29:26Z

Add exp6279_delphi_rl_math500.py, a launcher for the 100-step MATH-500 RL probe over the Delphi K=0.20 midtraining ladder. It codifies two checkpoint registries from the issue: the cold-start SFT winners on the 9e19 p33m67 anchor (laion/delphi-9e19-p33m67-coldstart-magpie_lr1e5 and -wc386k_lr1e5, pinned revisions, fetched via download_model_step) and the 27 best-endpoint midtrained checkpoints for RL-zero (prefix-relative paths; the launcher fails fast if the resolved executor prefix is not the us-east5 bucket where those exports live). The MATH-500 envelope (MathEnv prompt format, boxed-answer reward contract, sampling settings, RLOO loss) is imported from llama_3_8b_rl_math500.py so probes stay comparable across scales, mixes, and SFT recipes per the issue protocol; defaults are 100 trainer steps and a per-checkpoint --checkpoint registry key.

Delphi checkpoints are Qwen3-architecture HF exports that use the marin/Llama-3 tokenizer and chat template, so the vLLM inference context needed two small hooks: delphi model names now route to the Llama3 renderer (chat format) and to the levanter qwen weight mappings (architecture) in vllm_utils. Covered by a new test in tests/rl/test_inference_ctx.py.

Also renames the Llama launcher's private _default_rl_loss to default_math500_rl_loss so the shared loss recipe can be reused.

Part of #6279

Add exp6279_delphi_rl_math500.py: a 100-step MATH-500 RL probe launcher over the Delphi K=0.20 midtraining ladder, covering both the cold-start SFT checkpoints (laion/delphi-*-coldstart-*) and the raw best-endpoint midtrained checkpoints for RL-zero. The MATH-500 envelope (MathEnv, boxed-answer reward, sampling settings, RLOO loss) is shared with the Llama 3.1 8B launcher so probes stay comparable across scales and mixes. Teach the vLLM inference context that Delphi checkpoints are Qwen3-architecture HF exports that chat with the marin/Llama-3 template: route delphi model names to the Llama3 renderer and the levanter qwen weight mappings. Part of #6279 Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>

Midtrained Delphi endpoints only exist in the us-east5 bucket, so the launcher now rejects RL-zero runs whose resolved executor prefix is a different region instead of resolving a missing artifact path. The transpose-keys assertion now checks the exact qwen/llama mapping. Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>

cursoragent and others added 2 commits June 11, 2026 01:24

nikil-ravi added the agent-generated Created by automation/agent label Jun 11, 2026 — with Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rl] Add Delphi RL MATH-500 scaling probe launcher#6325

[rl] Add Delphi RL MATH-500 scaling probe launcher#6325
nikil-ravi wants to merge 2 commits into
mainfrom
cursor/fix-6279-delphi-rl-math500-70ff

nikil-ravi commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikil-ravi commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants