Split long prompt prefill into scheduler-managed chunks.
Long prompts can block short requests. Chunked prefill improves tail latency by sharing scheduler steps.
- prefill token budget
max_prefill_tokens_per_step- partial prefill state
- decode-first or mixed prefill/decode policy
- advanced fairness tuning
- full production scheduler policy matrix
- Long prompts no longer monopolize the engine loop.
- Short request latency improves in a mixed workload benchmark.
- The scheduler policy is documented.
Current implementation statement:
generate_chunked_prefill_batchacceptsContinuousBatchRequestrows and processes prompt prefill in bounded chunks viamax_prefill_tokens_per_step.- The teaching scheduler uses a decode-first, shortest-prefill-first policy: ready decode work runs before more prefill, and short remaining prompts can cut ahead of long prompts while respecting the per-step prefill budget.
- The implementation uses real Hugging Face
past_key_valuesacross prompt chunks and decode steps. It is still single-process and sequential inside each scheduler step rather than a production fused prefill/decode kernel. benchmark_chunked_prefill.pycompares a monolithic arrival-order baseline with chunked prefill and reports short-request time-to-first-token behavior.
- Add partial prefill request state.
- Add prefill token budget.
- Implement chunked prefill loop.
- Integrate with continuous batching.
- Add mixed short/long prompt benchmark.
- Document the latency tradeoff.