Skip to content

Latest commit

 

History

History
49 lines (35 loc) · 1.62 KB

File metadata and controls

49 lines (35 loc) · 1.62 KB

v0.7 Chunked Prefill

Goal

Split long prompt prefill into scheduler-managed chunks.

Why

Long prompts can block short requests. Chunked prefill improves tail latency by sharing scheduler steps.

Scope

  • prefill token budget
  • max_prefill_tokens_per_step
  • partial prefill state
  • decode-first or mixed prefill/decode policy

Out Of Scope

  • advanced fairness tuning
  • full production scheduler policy matrix

Acceptance Criteria

  • Long prompts no longer monopolize the engine loop.
  • Short request latency improves in a mixed workload benchmark.
  • The scheduler policy is documented.

Current implementation statement:

  • generate_chunked_prefill_batch accepts ContinuousBatchRequest rows and processes prompt prefill in bounded chunks via max_prefill_tokens_per_step.
  • The teaching scheduler uses a decode-first, shortest-prefill-first policy: ready decode work runs before more prefill, and short remaining prompts can cut ahead of long prompts while respecting the per-step prefill budget.
  • The implementation uses real Hugging Face past_key_values across prompt chunks and decode steps. It is still single-process and sequential inside each scheduler step rather than a production fused prefill/decode kernel.
  • benchmark_chunked_prefill.py compares a monolithic arrival-order baseline with chunked prefill and reports short-request time-to-first-token behavior.

Progress

  • Add partial prefill request state.
  • Add prefill token budget.
  • Implement chunked prefill loop.
  • Integrate with continuous batching.
  • Add mixed short/long prompt benchmark.
  • Document the latency tradeoff.