49 lines (35 loc) · 1.62 KB

v0.7 Chunked Prefill

Goal

Split long prompt prefill into scheduler-managed chunks.

Why

Long prompts can block short requests. Chunked prefill improves tail latency by sharing scheduler steps.

Scope

prefill token budget
max_prefill_tokens_per_step
partial prefill state
decode-first or mixed prefill/decode policy

Out Of Scope

advanced fairness tuning
full production scheduler policy matrix

Acceptance Criteria

Long prompts no longer monopolize the engine loop.
Short request latency improves in a mixed workload benchmark.
The scheduler policy is documented.

Current implementation statement:

generate_chunked_prefill_batch accepts ContinuousBatchRequest rows and processes prompt prefill in bounded chunks via max_prefill_tokens_per_step.
The teaching scheduler uses a decode-first, shortest-prefill-first policy: ready decode work runs before more prefill, and short remaining prompts can cut ahead of long prompts while respecting the per-step prefill budget.
The implementation uses real Hugging Face past_key_values across prompt chunks and decode steps. It is still single-process and sequential inside each scheduler step rather than a production fused prefill/decode kernel.
benchmark_chunked_prefill.py compares a monolithic arrival-order baseline with chunked prefill and reports short-request time-to-first-token behavior.

Progress

Add partial prefill request state.
Add prefill token budget.
Implement chunked prefill loop.
Integrate with continuous batching.
Add mixed short/long prompt benchmark.
Document the latency tradeoff.