Skip to content

[rl] Add Delphi RL MATH-500 scaling probe launcher#6325

Draft
nikil-ravi wants to merge 2 commits into
mainfrom
cursor/fix-6279-delphi-rl-math500-70ff
Draft

[rl] Add Delphi RL MATH-500 scaling probe launcher#6325
nikil-ravi wants to merge 2 commits into
mainfrom
cursor/fix-6279-delphi-rl-math500-70ff

Conversation

@nikil-ravi

Copy link
Copy Markdown
Contributor

Add exp6279_delphi_rl_math500.py, a launcher for the 100-step MATH-500 RL probe over the Delphi K=0.20 midtraining ladder. It codifies two checkpoint registries from the issue: the cold-start SFT winners on the 9e19 p33m67 anchor (laion/delphi-9e19-p33m67-coldstart-magpie_lr1e5 and -wc386k_lr1e5, pinned revisions, fetched via download_model_step) and the 27 best-endpoint midtrained checkpoints for RL-zero (prefix-relative paths; the launcher fails fast if the resolved executor prefix is not the us-east5 bucket where those exports live). The MATH-500 envelope (MathEnv prompt format, boxed-answer reward contract, sampling settings, RLOO loss) is imported from llama_3_8b_rl_math500.py so probes stay comparable across scales, mixes, and SFT recipes per the issue protocol; defaults are 100 trainer steps and a per-checkpoint --checkpoint registry key.

Delphi checkpoints are Qwen3-architecture HF exports that use the marin/Llama-3 tokenizer and chat template, so the vLLM inference context needed two small hooks: delphi model names now route to the Llama3 renderer (chat format) and to the levanter qwen weight mappings (architecture) in vllm_utils. Covered by a new test in tests/rl/test_inference_ctx.py.

Also renames the Llama launcher's private _default_rl_loss to default_math500_rl_loss so the shared loss recipe can be reused.

Part of #6279

Open in Web Open in Cursor 

cursoragent and others added 2 commits June 11, 2026 01:24
Add exp6279_delphi_rl_math500.py: a 100-step MATH-500 RL probe launcher
over the Delphi K=0.20 midtraining ladder, covering both the cold-start
SFT checkpoints (laion/delphi-*-coldstart-*) and the raw best-endpoint
midtrained checkpoints for RL-zero. The MATH-500 envelope (MathEnv,
boxed-answer reward, sampling settings, RLOO loss) is shared with the
Llama 3.1 8B launcher so probes stay comparable across scales and mixes.

Teach the vLLM inference context that Delphi checkpoints are
Qwen3-architecture HF exports that chat with the marin/Llama-3 template:
route delphi model names to the Llama3 renderer and the levanter qwen
weight mappings.

Part of #6279

Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>
Midtrained Delphi endpoints only exist in the us-east5 bucket, so the
launcher now rejects RL-zero runs whose resolved executor prefix is a
different region instead of resolving a missing artifact path. The
transpose-keys assertion now checks the exact qwen/llama mapping.

Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>
@nikil-ravi nikil-ravi added the agent-generated Created by automation/agent label Jun 11, 2026 — with Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants