You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Explain and reduce the v6e prefill-heavy gap for prefill_b8_i2048_o128_n1.
Historical comparison row: Levanter 1006.83 decode tok/s and 17116.08 total tok/s versus vLLM 1262.90 decode tok/s and 21469.31 total tok/s, ratio 0.797. That measured Levanter row had prefill chunks 4096,4096,4096,4096 and one decode iteration with 0.802s total, 0.623s device, 0.178s host, and 0.174s submit.
The corrected Levanter-only diagnostic completed at /dlwh/qwen3-v6e8-prefilldiag-drain-prefill-b8-i2048-o128-n1-20260606-1532 from PR #6185 commit 82bb6dbfb. It validated diagnostic prefill drain and produced 1022 decode iteration tokens for normal, no_lm_head, and lm_head_no_sampling rows.
Important interpretation correction: production decode_submit_seconds was previously measured from the start of the outer iteration before prefill drain, so the 0.174s/0.184s submit field included prefill admission work. PR #6185 commit dd993c03c moved the submit timer to immediately before _run_generation_loop(...), commit 6d74fc7ea added decode iteration/device throughput fields, and commit f55913d1 now separates prefill drain from the generation loop with prefill_drain_*, generation_*, and derived generation_tokens_per_second benchmark fields.
Corrected backend=both row: /dlwh/qwen3-v6e8-prefillcorr-20260607-0023 completed successfully from #6185 head 91f6ec06a. vLLM measured 1212.15 decode tok/s and 20606.54 total tok/s. Levanter measured 898.94 decode tok/s and 15282.00 total tok/s, ratio 0.742, target fail. The corrected Levanter fields available in that run show prefill chunks 4096,4096,4096,4096, prefill admission wall times 0.060,0.055,0.055,0.055s, decode iteration 0.852s total / 0.677s device / 0.174s host / 0.002s submit / 0.002s extract, 1022 decode iteration tokens, 1200.223 decode-iteration tok/s, and 1509.350 device tok/s. Because that row predates f55913d1, it does not yet include the new generation-only token/time split.
Post-f55913d1 backend=both row: /dlwh/qwen3-v6e8-prefillpostsplit-20260607-0419 completed successfully from #6185 head d63d1edf. vLLM measured 1264.03 decode tok/s and 21488.50 total tok/s. Levanter measured 957.72 decode tok/s and 16281.24 total tok/s, ratio 0.758, target fail. The new split shows hot prefill chunks 4096,4096,4096,4096 with prefill wall times 0.060,0.056,0.056,0.056s, decode iteration 0.800s total / 0.623s device / 0.177s host / 0.002s submit / 0.002s extract, 1022 decode iteration tokens, 0.171s prefill drain for 6 tokens, and 0.629s generation for 1016 tokens. Derived throughput: 1277.749 decode-iteration tok/s, 1639.896 decode-device tok/s, and 1614.816 generation tok/s.
Follow-up code changes:
[levanter] Add multi-prefill admission for serving #6185 commit f7275e5a stops logging full InferenceRequest dataclasses at INFO and instead logs bounded request metadata. In the prefill-heavy measured row, the old path serialized eight 2048-token prompt lists before the first prefill admission; this was a concrete host-side measured-path overhead source. Validation passed with inference-server pytest (19 passed, 1 skipped) plus focused pre-commit/Pyrefly on the touched files.
[levanter] Add multi-prefill admission for serving #6185 commit f55913d1 adds per-iteration prefill_drain_seconds_per_iteration, prefill_drain_tokens_per_iteration, generation_seconds_per_iteration, generation_host_seconds_per_iteration, generation_tokens_per_iteration, and generation_tokens_per_second to avoid conflating prefill-drain work with generation-loop throughput in future [levanter] Optimize v6e prefill-heavy decode gap #6229 rows. Validation passed with test_engine.py and test_qwen3_tpu_inference_parity_bench.py (63 passed) plus focused pre-commit/Pyrefly.
Current interpretation: the remaining failure is attributed to end-to-end prefill-heavy wall-clock overhead rather than a proven decode-device regression. The issue remains open as a performance target because the post-f55913d1 row still fails the end-to-end parity target at 0.758, even though generation-only throughput is above the vLLM decode row. The remaining question is whether to optimize hot prefill/admission/serving wall-clock for this weak prefill-heavy regime, or deprioritize it relative to the core RL rollout rows.
Definition of Done
Rerun one corrected Levanter-only diagnostic on the fixed harness.
Attribute the original row enough to avoid a false decode-kernel conclusion: the old submit timing was contaminated by prefill-drain work, and new benchmark fields separate prefill-drain, generation-loop, device, host, submit, and extraction timing.
Collect a post-f55913d1 row for prefill_b8_i2048_o128_n1 if this regime remains a priority, so the result includes generation_* and prefill_drain_* fields after bounded request logging.
Decide the next optimization or close/deprioritize this issue based on that post-f55913d1 row: prefill/host accounting, page/admission settings, or no further work if the row is acceptable for RL rollout priorities.
Update the epic and handoff with terminal job results from the completed diagnostics/reruns.
Current Next Step
Do not launch another TPU run automatically. The next step is a decision, not more evidence collection: either optimize the remaining hot prefill/admission/serving wall-clock overhead for this regime, or close/deprioritize it if core RL rollout rows matter more than prefill-heavy parity. The post-split row should be interpreted using generation_tokens_per_second=1614.816, prefill_drain_seconds_per_iteration=0.171, and generation_host_seconds_per_iteration=0.006, not only the end-to-end decode_tokens_per_second=957.72.
Description
Explain and reduce the v6e prefill-heavy gap for
prefill_b8_i2048_o128_n1.Historical comparison row: Levanter
1006.83decode tok/s and17116.08total tok/s versus vLLM1262.90decode tok/s and21469.31total tok/s, ratio0.797. That measured Levanter row had prefill chunks4096,4096,4096,4096and one decode iteration with0.802stotal,0.623sdevice,0.178shost, and0.174ssubmit.The corrected Levanter-only diagnostic completed at
/dlwh/qwen3-v6e8-prefilldiag-drain-prefill-b8-i2048-o128-n1-20260606-1532from PR #6185 commit82bb6dbfb. It validated diagnostic prefill drain and produced1022decode iteration tokens for normal,no_lm_head, andlm_head_no_samplingrows.Important interpretation correction: production
decode_submit_secondswas previously measured from the start of the outer iteration before prefill drain, so the0.174s/0.184ssubmit field included prefill admission work. PR #6185 commitdd993c03cmoved the submit timer to immediately before_run_generation_loop(...), commit6d74fc7eaadded decode iteration/device throughput fields, and commitf55913d1now separates prefill drain from the generation loop withprefill_drain_*,generation_*, and derivedgeneration_tokens_per_secondbenchmark fields.Corrected backend=both row:
/dlwh/qwen3-v6e8-prefillcorr-20260607-0023completed successfully from #6185 head91f6ec06a. vLLM measured1212.15decode tok/s and20606.54total tok/s. Levanter measured898.94decode tok/s and15282.00total tok/s, ratio0.742, target fail. The corrected Levanter fields available in that run show prefill chunks4096,4096,4096,4096, prefill admission wall times0.060,0.055,0.055,0.055s, decode iteration0.852stotal /0.677sdevice /0.174shost /0.002ssubmit /0.002sextract,1022decode iteration tokens,1200.223decode-iteration tok/s, and1509.350device tok/s. Because that row predatesf55913d1, it does not yet include the new generation-only token/time split.Post-
f55913d1backend=both row:/dlwh/qwen3-v6e8-prefillpostsplit-20260607-0419completed successfully from #6185 headd63d1edf. vLLM measured1264.03decode tok/s and21488.50total tok/s. Levanter measured957.72decode tok/s and16281.24total tok/s, ratio0.758, target fail. The new split shows hot prefill chunks4096,4096,4096,4096with prefill wall times0.060,0.056,0.056,0.056s, decode iteration0.800stotal /0.623sdevice /0.177shost /0.002ssubmit /0.002sextract,1022decode iteration tokens,0.171sprefill drain for6tokens, and0.629sgeneration for1016tokens. Derived throughput:1277.749decode-iteration tok/s,1639.896decode-device tok/s, and1614.816generation tok/s.Follow-up code changes:
f7275e5astops logging fullInferenceRequestdataclasses at INFO and instead logs bounded request metadata. In the prefill-heavy measured row, the old path serialized eight 2048-token prompt lists before the first prefill admission; this was a concrete host-side measured-path overhead source. Validation passed with inference-server pytest (19 passed, 1 skipped) plus focused pre-commit/Pyrefly on the touched files.f55913d1adds per-iterationprefill_drain_seconds_per_iteration,prefill_drain_tokens_per_iteration,generation_seconds_per_iteration,generation_host_seconds_per_iteration,generation_tokens_per_iteration, andgeneration_tokens_per_secondto avoid conflating prefill-drain work with generation-loop throughput in future [levanter] Optimize v6e prefill-heavy decode gap #6229 rows. Validation passed withtest_engine.pyandtest_qwen3_tpu_inference_parity_bench.py(63 passed) plus focused pre-commit/Pyrefly.Current interpretation: the remaining failure is attributed to end-to-end prefill-heavy wall-clock overhead rather than a proven decode-device regression. The issue remains open as a performance target because the post-
f55913d1row still fails the end-to-end parity target at0.758, even though generation-only throughput is above the vLLM decode row. The remaining question is whether to optimize hot prefill/admission/serving wall-clock for this weak prefill-heavy regime, or deprioritize it relative to the core RL rollout rows.Definition of Done
f55913d1row forprefill_b8_i2048_o128_n1if this regime remains a priority, so the result includesgeneration_*andprefill_drain_*fields after bounded request logging.f55913d1row: prefill/host accounting, page/admission settings, or no further work if the row is acceptable for RL rollout priorities.Current Next Step
Do not launch another TPU run automatically. The next step is a decision, not more evidence collection: either optimize the remaining hot prefill/admission/serving wall-clock overhead for this regime, or close/deprioritize it if core RL rollout rows matter more than prefill-heavy parity. The post-split row should be interpreted using
generation_tokens_per_second=1614.816,prefill_drain_seconds_per_iteration=0.171, andgeneration_host_seconds_per_iteration=0.006, not only the end-to-enddecode_tokens_per_second=957.72.Parent