Skip to content

Commit b3ef447

Browse files
authored
Merge pull request #450 from windsonsea/fixor
fix anchor in inside-vllm.md
2 parents f567208 + f533f7b commit b3ef447

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

docs/zh/docs/en/blogs/2025/inside-vllm.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ Later posts will dive into specific subsystems.
1414

1515
This post is structured into five parts:
1616

17-
1. [LLM engine & engine core](#llm-engine--engine-core): fundamentals of vLLM (scheduling, paged attention, continuous batching, etc.)
18-
2. [Advanced features](#advanced-features--extending-the-core-engine-logic): chunked prefill, prefix caching, guided & speculative decoding, disaggregated P/D
17+
1. [LLM engine & engine core](#llm-engine-engine-core): fundamentals of vLLM (scheduling, paged attention, continuous batching, etc.)
18+
2. [Advanced features](#advanced-features-extending-the-core-engine-logic): chunked prefill, prefix caching, guided & speculative decoding, disaggregated P/D
1919
3. [Scaling up](#from-uniprocexecutor-to-multiprocexecutor): from single-GPU to multi-GPU execution
2020
4. [Serving layer](#distributed-system-serving-vllm): distributed / concurrent web scaffolding
21-
5. [Benchmarks and auto-tuning](#benchmarks-and-auto-tuning---latency-vs-throughput): measuring latency and throughput
21+
5. [Benchmarks and auto-tuning](#benchmarks-and-auto-tuning-latency-vs-throughput): measuring latency and throughput
2222

2323
!!! note
2424

@@ -188,7 +188,7 @@ There are two main types of workloads an inference engine handles:
188188

189189
!!! tip
190190

191-
In the [benchmarking section](#benchmarks-and-auto-tuning---latency-vs-throughput) we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
191+
In the [benchmarking section](#benchmarks-and-auto-tuning-latency-vs-throughput) we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
192192

193193
The V1 scheduler can mix both types of requests in the same step, thanks to smarter design choices. In contrast, the V0 engine could only process either prefill or decode at once.
194194

@@ -518,7 +518,7 @@ The best way to internalize this is to fire up your debugger and step through th
518518

519519
I've already previously hinted at the motivation behind disaggregated P/D (prefill/decode).
520520

521-
Prefill and decode have very different performance profiles (compute-bound vs. memory-bandwidth-bound), so separating their execution is a sensible design. It gives tighter control over latency — both `TFTT` (time-to-first-token) and `ITL` (inter-token latency) — more on this in the [benchmarking](#benchmarks-and-auto-tuning---latency-vs-throughput) section.
521+
Prefill and decode have very different performance profiles (compute-bound vs. memory-bandwidth-bound), so separating their execution is a sensible design. It gives tighter control over latency — both `TFTT` (time-to-first-token) and `ITL` (inter-token latency) — more on this in the [benchmarking](#benchmarks-and-auto-tuning-latency-vs-throughput) section.
522522

523523
In practice, we run `N` vLLM prefill instances and `M` vLLM decode instances, autoscaling them based on the live request mix. Prefill workers write KV to a dedicated KV-cache service; decode workers read from it. This isolates long, bursty prefill from steady, latency-sensitive decode.
524524

0 commit comments

Comments
 (0)