[levanter] Optimize v6e prefill-heavy decode gap

## Description
Explain and reduce the v6e prefill-heavy gap for `prefill_b8_i2048_o128_n1`.

Historical comparison row: Levanter `1006.83` decode tok/s and `17116.08` total tok/s versus vLLM `1262.90` decode tok/s and `21469.31` total tok/s, ratio `0.797`. That measured Levanter row had prefill chunks `4096,4096,4096,4096` and one decode iteration with `0.802s` total, `0.623s` device, `0.178s` host, and `0.174s` submit.

The corrected Levanter-only diagnostic completed at `/dlwh/qwen3-v6e8-prefilldiag-drain-prefill-b8-i2048-o128-n1-20260606-1532` from PR #6185 commit `82bb6dbfb`. It validated diagnostic prefill drain and produced `1022` decode iteration tokens for normal, `no_lm_head`, and `lm_head_no_sampling` rows.

Important interpretation correction: production `decode_submit_seconds` was previously measured from the start of the outer iteration before prefill drain, so the `0.174s`/`0.184s` submit field included prefill admission work. PR #6185 commit `dd993c03c` moved the submit timer to immediately before `_run_generation_loop(...)`, commit `6d74fc7ea` added decode iteration/device throughput fields, and commit `f55913d1` now separates prefill drain from the generation loop with `prefill_drain_*`, `generation_*`, and derived `generation_tokens_per_second` benchmark fields.

Corrected backend=both row: `/dlwh/qwen3-v6e8-prefillcorr-20260607-0023` completed successfully from #6185 head `91f6ec06a`. vLLM measured `1212.15` decode tok/s and `20606.54` total tok/s. Levanter measured `898.94` decode tok/s and `15282.00` total tok/s, ratio `0.742`, target fail. The corrected Levanter fields available in that run show prefill chunks `4096,4096,4096,4096`, prefill admission wall times `0.060,0.055,0.055,0.055s`, decode iteration `0.852s` total / `0.677s` device / `0.174s` host / `0.002s` submit / `0.002s` extract, `1022` decode iteration tokens, `1200.223` decode-iteration tok/s, and `1509.350` device tok/s. Because that row predates `f55913d1`, it does not yet include the new generation-only token/time split.


Post-`f55913d1` backend=both row: `/dlwh/qwen3-v6e8-prefillpostsplit-20260607-0419` completed successfully from #6185 head `d63d1edf`. vLLM measured `1264.03` decode tok/s and `21488.50` total tok/s. Levanter measured `957.72` decode tok/s and `16281.24` total tok/s, ratio `0.758`, target fail. The new split shows hot prefill chunks `4096,4096,4096,4096` with prefill wall times `0.060,0.056,0.056,0.056s`, decode iteration `0.800s` total / `0.623s` device / `0.177s` host / `0.002s` submit / `0.002s` extract, `1022` decode iteration tokens, `0.171s` prefill drain for `6` tokens, and `0.629s` generation for `1016` tokens. Derived throughput: `1277.749` decode-iteration tok/s, `1639.896` decode-device tok/s, and `1614.816` generation tok/s.

Follow-up code changes:
- #6185 commit `f7275e5a` stops logging full `InferenceRequest` dataclasses at INFO and instead logs bounded request metadata. In the prefill-heavy measured row, the old path serialized eight 2048-token prompt lists before the first prefill admission; this was a concrete host-side measured-path overhead source. Validation passed with inference-server pytest (`19 passed, 1 skipped`) plus focused pre-commit/Pyrefly on the touched files.
- #6185 commit `f55913d1` adds per-iteration `prefill_drain_seconds_per_iteration`, `prefill_drain_tokens_per_iteration`, `generation_seconds_per_iteration`, `generation_host_seconds_per_iteration`, `generation_tokens_per_iteration`, and `generation_tokens_per_second` to avoid conflating prefill-drain work with generation-loop throughput in future #6229 rows. Validation passed with `test_engine.py` and `test_qwen3_tpu_inference_parity_bench.py` (`63 passed`) plus focused pre-commit/Pyrefly.

Current interpretation: the remaining failure is attributed to end-to-end prefill-heavy wall-clock overhead rather than a proven decode-device regression. The issue remains open as a performance target because the post-`f55913d1` row still fails the end-to-end parity target at `0.758`, even though generation-only throughput is above the vLLM decode row. The remaining question is whether to optimize hot prefill/admission/serving wall-clock for this weak prefill-heavy regime, or deprioritize it relative to the core RL rollout rows.

### Definition of Done
- [x] Rerun one corrected Levanter-only diagnostic on the fixed harness.
- [x] Attribute the original row enough to avoid a false decode-kernel conclusion: the old submit timing was contaminated by prefill-drain work, and new benchmark fields separate prefill-drain, generation-loop, device, host, submit, and extraction timing.
- [x] Collect a post-`f55913d1` row for `prefill_b8_i2048_o128_n1` if this regime remains a priority, so the result includes `generation_*` and `prefill_drain_*` fields after bounded request logging.
- [ ] Decide the next optimization or close/deprioritize this issue based on that post-`f55913d1` row: prefill/host accounting, page/admission settings, or no further work if the row is acceptable for RL rollout priorities.
- [x] Update the epic and handoff with terminal job results from the completed diagnostics/reruns.

### Current Next Step
Do not launch another TPU run automatically. The next step is a decision, not more evidence collection: either optimize the remaining hot prefill/admission/serving wall-clock overhead for this regime, or close/deprioritize it if core RL rollout rows matter more than prefill-heavy parity. The post-split row should be interpreted using `generation_tokens_per_second=1614.816`, `prefill_drain_seconds_per_iteration=0.171`, and `generation_host_seconds_per_iteration=0.006`, not only the end-to-end `decode_tokens_per_second=957.72`.

### Parent
- #6227



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[levanter] Optimize v6e prefill-heavy decode gap #6229

Description

Definition of Done

Current Next Step

Parent

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[levanter] Optimize v6e prefill-heavy decode gap #6229

Description

Description

Definition of Done

Current Next Step

Parent

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions