Add Levanter TPU RL rollout handoff#6214
Conversation
|
🤖 Handoff update: Config:
Corrected Levanter timing fields: |
|
🤖 PR description refreshed with the latest tracker decisions from #6227: #6231 keeps |
|
🤖 #6240 post-rebase CI is now terminal green/skipped at head |
|
🤖 Handoff PR #6214 advanced to docs-only head |
|
🤖 CI/status update: #6214 head |
|
Claude finished @dlwh's task in 5m 9s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. Notes from the multi-agent pass:
|
|
🤖 Review-status update: #6214 is now non-draft and ready for maintainer review at head |
|
🤖 Status update: #6214 head |
Refresh the durable Levanter TPU RL rollout handoff so it records the current evidence and next steps from the v6e/v5p follow-ups. The handoff links the rollout epic and child issues, records the original v6e prefill-heavy target failure, captures the corrected diagnostic attribution, tracks #6185 head
d63d1edf, and clarifies the v5p state: the first mixed run failed before measurement, the TP=8 startup-diagnostics retry failed because only 4 devices were visible, and the bounded TP=4 retry succeeded but is a throughput target failure.The handoff also records the latest post-split v6e prefill-heavy backend=both row
/dlwh/qwen3-v6e8-prefillpostsplit-20260607-0419, launched from #6185d63d1edf: vLLM measured1264.03decode tok/s and21488.50total tok/s, while Levanter measured957.72decode tok/s and16281.24total tok/s, ratio0.758, target fail. The new split fields show hot prefill chunks4096,4096,4096,4096, prefill drain0.171sfor6tokens, generation0.629sfor1016tokens, generation host0.006s, derived generation throughput1614.816tok/s, decode-iteration throughput1277.749tok/s, and decode-device throughput1639.896tok/s. #6229 now needs an optimize-vs-deprioritize decision rather than another evidence row.The v5p TP=4 retry
/dlwh/qwen3-mixed-v5p-startupdiag-tp4-i512-o512-20260607-0724, launched from #624032505d1castacked on #6185d63d1edf, succeeded and produced benchmark rows. vLLM measured5215.67decode tok/s and10431.34total tok/s; Levanter measured3131.59decode tok/s and6263.19total tok/s, ratio0.600, target fail. Levanter produced all16384completion tokens with four prefill admissions/chunks (4096,4096,4096,4096), first prefill drain0.257s, and3553.034generation tok/s. #6230 now tracks the v5p throughput gap rather than startup evidence.The surrounding rollout stack remains ready for review: #6176 (
104bf901) carries the dense Qwen3-8B matrix harness/config and is green/skipped; #6185 (d63d1edf) carries the multi-prefill admission, prefill-drain scheduling, corrected timing attribution, vLLM startup-log improvements, bounded request logging, prefill-drain/generation timing split, and docs-hygiene move into.agents/ops/, with all visible checks green/skipped after one external HF/cache rerun; #6186 (c4ba37b) carries the token-native rollout API/data-plane slice and is green/skipped; #6240 (32505d1c) is ready for review, stacks on #6185, and adds bounded vLLM startup runtime/package/env snapshots for future v5p failures, with post-rebase checks terminal green/skipped includinglevanter-tpu-tests.Tracker decisions captured in the epic: #6231 documents that #6186 should not be blocked on converting
PrimeIntellectEnvto token-native; keep it OpenAI/verifiers-compatible for this stack while preserving attached response token IDs and tokenizer/policy replay identity. #6228 gates dense-matrix reruns on #6185 landing or explicit maintainer authorization, prioritizes canonical v6e decode/mixed rows, and now records the v5p mixed TP=4 row as a measured target failure rather than a startup-evidence gap.This handoff head is
ebbe9a69dafter a docs-only correction clarifying the review/landing order and checked-in current-state head. It is terminal green/skipped, includingmarin-integration,marin-lint, CodeQL, ReadTheDocs, and the relevant unit/change lanes. #6214 is non-draft and review-required.Part of #6227