Skip to content

[levanter] Optimize v6e prefill-heavy decode gap #6229

@dlwh

Description

@dlwh

Description

Explain and reduce the v6e prefill-heavy gap for prefill_b8_i2048_o128_n1.

Historical comparison row: Levanter 1006.83 decode tok/s and 17116.08 total tok/s versus vLLM 1262.90 decode tok/s and 21469.31 total tok/s, ratio 0.797. That measured Levanter row had prefill chunks 4096,4096,4096,4096 and one decode iteration with 0.802s total, 0.623s device, 0.178s host, and 0.174s submit.

The corrected Levanter-only diagnostic completed at /dlwh/qwen3-v6e8-prefilldiag-drain-prefill-b8-i2048-o128-n1-20260606-1532 from PR #6185 commit 82bb6dbfb. It validated diagnostic prefill drain and produced 1022 decode iteration tokens for normal, no_lm_head, and lm_head_no_sampling rows.

Important interpretation correction: production decode_submit_seconds was previously measured from the start of the outer iteration before prefill drain, so the 0.174s/0.184s submit field included prefill admission work. PR #6185 commit dd993c03c moved the submit timer to immediately before _run_generation_loop(...), commit 6d74fc7ea added decode iteration/device throughput fields, and commit f55913d1 now separates prefill drain from the generation loop with prefill_drain_*, generation_*, and derived generation_tokens_per_second benchmark fields.

Corrected backend=both row: /dlwh/qwen3-v6e8-prefillcorr-20260607-0023 completed successfully from #6185 head 91f6ec06a. vLLM measured 1212.15 decode tok/s and 20606.54 total tok/s. Levanter measured 898.94 decode tok/s and 15282.00 total tok/s, ratio 0.742, target fail. The corrected Levanter fields available in that run show prefill chunks 4096,4096,4096,4096, prefill admission wall times 0.060,0.055,0.055,0.055s, decode iteration 0.852s total / 0.677s device / 0.174s host / 0.002s submit / 0.002s extract, 1022 decode iteration tokens, 1200.223 decode-iteration tok/s, and 1509.350 device tok/s. Because that row predates f55913d1, it does not yet include the new generation-only token/time split.

Post-f55913d1 backend=both row: /dlwh/qwen3-v6e8-prefillpostsplit-20260607-0419 completed successfully from #6185 head d63d1edf. vLLM measured 1264.03 decode tok/s and 21488.50 total tok/s. Levanter measured 957.72 decode tok/s and 16281.24 total tok/s, ratio 0.758, target fail. The new split shows hot prefill chunks 4096,4096,4096,4096 with prefill wall times 0.060,0.056,0.056,0.056s, decode iteration 0.800s total / 0.623s device / 0.177s host / 0.002s submit / 0.002s extract, 1022 decode iteration tokens, 0.171s prefill drain for 6 tokens, and 0.629s generation for 1016 tokens. Derived throughput: 1277.749 decode-iteration tok/s, 1639.896 decode-device tok/s, and 1614.816 generation tok/s.

Follow-up code changes:

  • [levanter] Add multi-prefill admission for serving #6185 commit f7275e5a stops logging full InferenceRequest dataclasses at INFO and instead logs bounded request metadata. In the prefill-heavy measured row, the old path serialized eight 2048-token prompt lists before the first prefill admission; this was a concrete host-side measured-path overhead source. Validation passed with inference-server pytest (19 passed, 1 skipped) plus focused pre-commit/Pyrefly on the touched files.
  • [levanter] Add multi-prefill admission for serving #6185 commit f55913d1 adds per-iteration prefill_drain_seconds_per_iteration, prefill_drain_tokens_per_iteration, generation_seconds_per_iteration, generation_host_seconds_per_iteration, generation_tokens_per_iteration, and generation_tokens_per_second to avoid conflating prefill-drain work with generation-loop throughput in future [levanter] Optimize v6e prefill-heavy decode gap #6229 rows. Validation passed with test_engine.py and test_qwen3_tpu_inference_parity_bench.py (63 passed) plus focused pre-commit/Pyrefly.

Current interpretation: the remaining failure is attributed to end-to-end prefill-heavy wall-clock overhead rather than a proven decode-device regression. The issue remains open as a performance target because the post-f55913d1 row still fails the end-to-end parity target at 0.758, even though generation-only throughput is above the vLLM decode row. The remaining question is whether to optimize hot prefill/admission/serving wall-clock for this weak prefill-heavy regime, or deprioritize it relative to the core RL rollout rows.

Definition of Done

  • Rerun one corrected Levanter-only diagnostic on the fixed harness.
  • Attribute the original row enough to avoid a false decode-kernel conclusion: the old submit timing was contaminated by prefill-drain work, and new benchmark fields separate prefill-drain, generation-loop, device, host, submit, and extraction timing.
  • Collect a post-f55913d1 row for prefill_b8_i2048_o128_n1 if this regime remains a priority, so the result includes generation_* and prefill_drain_* fields after bounded request logging.
  • Decide the next optimization or close/deprioritize this issue based on that post-f55913d1 row: prefill/host accounting, page/admission settings, or no further work if the row is acceptable for RL rollout priorities.
  • Update the epic and handoff with terminal job results from the completed diagnostics/reruns.

Current Next Step

Do not launch another TPU run automatically. The next step is a decision, not more evidence collection: either optimize the remaining hot prefill/admission/serving wall-clock overhead for this regime, or close/deprioritize it if core RL rollout rows matter more than prefill-heavy parity. The post-split row should be interpreted using generation_tokens_per_second=1614.816, prefill_drain_seconds_per_iteration=0.171, and generation_host_seconds_per_iteration=0.006, not only the end-to-end decode_tokens_per_second=957.72.

Parent

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentlevanterIssues related to Levanter librarytpuUsed for dispatching the TPU tests in CI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions