You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3.[Scaling up](#from-uniprocexecutor-to-multiprocexecutor): from single-GPU to multi-GPU execution
20
+
4.[Serving layer](#distributed-system-serving-vllm): distributed / concurrent web scaffolding
21
+
5.[Benchmarks and auto-tuning](#benchmarks-and-auto-tuning---latency-vs-throughput): measuring latency and throughput
22
22
23
23
!!! note
24
24
@@ -188,7 +188,7 @@ There are two main types of workloads an inference engine handles:
188
188
189
189
!!! tip
190
190
191
-
In the [benchmarking section](https://www.aleksagordic.com/blog/vllm#cpt5) we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
191
+
In the [benchmarking section](#benchmarks-and-auto-tuning---latency-vs-throughput) we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
192
192
193
193
The V1 scheduler can mix both types of requests in the same step, thanks to smarter design choices. In contrast, the V0 engine could only process either prefill or decode at once.
194
194
@@ -518,7 +518,7 @@ The best way to internalize this is to fire up your debugger and step through th
518
518
519
519
I've already previously hinted at the motivation behind disaggregated P/D (prefill/decode).
520
520
521
-
Prefill and decode have very different performance profiles (compute-bound vs. memory-bandwidth-bound), so separating their execution is a sensible design. It gives tighter control over latency — both `TFTT` (time-to-first-token) and `ITL` (inter-token latency) — more on this in the [benchmarking](https://www.aleksagordic.com/blog/vllm#cpt5) section.
521
+
Prefill and decode have very different performance profiles (compute-bound vs. memory-bandwidth-bound), so separating their execution is a sensible design. It gives tighter control over latency — both `TFTT` (time-to-first-token) and `ITL` (inter-token latency) — more on this in the [benchmarking](#benchmarks-and-auto-tuning---latency-vs-throughput) section.
522
522
523
523
In practice, we run `N` vLLM prefill instances and `M` vLLM decode instances, autoscaling them based on the live request mix. Prefill workers write KV to a dedicated KV-cache service; decode workers read from it. This isolates long, bursty prefill from steady, latency-sensitive decode.
0 commit comments