[rl] Add Delphi RL MATH-500 scaling probe launcher#6325
Draft
nikil-ravi wants to merge 2 commits into
Draft
Conversation
Add exp6279_delphi_rl_math500.py: a 100-step MATH-500 RL probe launcher over the Delphi K=0.20 midtraining ladder, covering both the cold-start SFT checkpoints (laion/delphi-*-coldstart-*) and the raw best-endpoint midtrained checkpoints for RL-zero. The MATH-500 envelope (MathEnv, boxed-answer reward, sampling settings, RLOO loss) is shared with the Llama 3.1 8B launcher so probes stay comparable across scales and mixes. Teach the vLLM inference context that Delphi checkpoints are Qwen3-architecture HF exports that chat with the marin/Llama-3 template: route delphi model names to the Llama3 renderer and the levanter qwen weight mappings. Part of #6279 Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>
Midtrained Delphi endpoints only exist in the us-east5 bucket, so the launcher now rejects RL-zero runs whose resolved executor prefix is a different region instead of resolving a missing artifact path. The transpose-keys assertion now checks the exact qwen/llama mapping. Co-authored-by: Nikil Ravi <nikil-ravi@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add exp6279_delphi_rl_math500.py, a launcher for the 100-step MATH-500 RL probe over the Delphi K=0.20 midtraining ladder. It codifies two checkpoint registries from the issue: the cold-start SFT winners on the 9e19 p33m67 anchor (laion/delphi-9e19-p33m67-coldstart-magpie_lr1e5 and -wc386k_lr1e5, pinned revisions, fetched via download_model_step) and the 27 best-endpoint midtrained checkpoints for RL-zero (prefix-relative paths; the launcher fails fast if the resolved executor prefix is not the us-east5 bucket where those exports live). The MATH-500 envelope (MathEnv prompt format, boxed-answer reward contract, sampling settings, RLOO loss) is imported from llama_3_8b_rl_math500.py so probes stay comparable across scales, mixes, and SFT recipes per the issue protocol; defaults are 100 trainer steps and a per-checkpoint --checkpoint registry key.
Delphi checkpoints are Qwen3-architecture HF exports that use the marin/Llama-3 tokenizer and chat template, so the vLLM inference context needed two small hooks: delphi model names now route to the Llama3 renderer (chat format) and to the levanter qwen weight mappings (architecture) in vllm_utils. Covered by a new test in tests/rl/test_inference_ctx.py.
Also renames the Llama launcher's private _default_rl_loss to default_math500_rl_loss so the shared loss recipe can be reused.
Part of #6279