Skip to content

Commit 5d454d0

Browse files
authored
Merge pull request #448 from windsonsea/enanch
Fix en anchors in inside-vllm.md
2 parents d98e374 + 126e6a4 commit 5d454d0

File tree

2 files changed

+27
-33
lines changed

2 files changed

+27
-33
lines changed

docs/zh/docs/blogs/2025/inside-vllm.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
1. [大语言模型引擎和引擎核心](#_1):vLLM 基础知识(调度、分页注意力、连续批处理等)
2020
2. [高级特性](#_5):分块预填充、前缀缓存、引导解码与投机解码、P/D 分离
2121
3. [扩容](#uniprocexecutor-multiprocexecutor):从单 GPU 到多 GPU
22-
4. [分层部署](#vllm_1):分布式/并发 Web 框架
22+
4. [分层部署](#vllm_1):分布式/并发式 Web 框架
2323
5. [基准测试与自动调优](#vs):测量延迟和吞吐量
2424

2525
!!! note
@@ -194,7 +194,7 @@ KV-cache 管理器维护一个 `free_block_queue`。这是所有可用 KV-cache
194194

195195
!!! tip
196196

197-
在 [基准测试章节](https://www.aleksagordic.com/blog/vllm#cpt5) 中,我们将分析 GPU 性能的所谓 roofline 模型,这将详细说明预填充/解码的性能特征。
197+
在 [基准测试章节](#vs) 中,我们将分析 GPU 性能的所谓 roofline 模型,这将详细说明预填充/解码的性能特征。
198198

199199
V1 调度器可以在同一步中混合处理两类请求,这得益于更智能的设计选择。相比之下,V0 引擎一次只能处理预填充或解码请求。
200200

@@ -526,7 +526,7 @@ if __name__ == "__main__":
526526

527527
预填充和解码的性能特性非常不同(计算受限 vs. 内存带宽受限),因此将它们分离执行是合理的设计。这能更紧密地控制延迟,
528528
包括 `TFTT`(time-to-first-token,第一个 Token 的时间)和 `ITL`(inter-token latency,即 Token 间延迟)。
529-
更多内容见[基准测试](https://www.aleksagordic.com/blog/vllm#cpt5) 章节。
529+
更多内容见[基准测试](#vs) 章节。
530530

531531
实际操作中,我们运行 `N` 个 vLLM预填充实例和 `M` 个 vLLM 解码实例,根据实时请求负载自动伸缩。预填充工作线程将 KV 写入专用 KV-cache 服务;解码工作线程从中读取。这将长时间、突发的预填充与稳定、延迟敏感的解码隔离开来。
532532

@@ -738,7 +738,7 @@ vllm serve <model-name>
738738

739739
vLLM 中的实现方式:
740740

741-
### 在 headless 服务器节点
741+
### 在 headless 服务器节点上
742742

743743
在 headless 节点上,`CoreEngineProcManager` 启动 2 个进程(根据 `--data-parallel-size-local`),每个进程运行 `EngineCoreProc.run_engine_core`。每个函数会创建一个 `DPEngineCoreProc`(引擎核心),然后进入其忙循环。
744744

@@ -780,7 +780,7 @@ TL;DR:最终我们有 4 个子进程(每个 DP 副本一个),每个子
780780

781781
接下来,我们来看第二部分:API 服务器节点会发生什么?
782782

783-
### 在 API 服务器节点
783+
### 在 API 服务器节点上
784784

785785
我们实例化一个 `AsyncLLM` 对象(LLM 引擎的 asyncio 包装器)。内部会创建一个 `DPLBAsyncMPClient`(数据并行、负载均衡、异步、多进程客户端)。
786786

docs/zh/docs/en/blogs/2025/inside-vllm.md

Lines changed: 22 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
August 29, 2025
88

9-
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [[1]](https://www.aleksagordic.com/blog/vllm#ref-1) works.
9+
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how [vLLM](https://github.com/vllm-project/vllm) works.
1010

1111
This post is the first in a series. It starts broad and then layers in detail (following an inverse-pyramid approach) so you can form an accurate high-level mental model of the complete system without drowning in minutiae.
1212

@@ -62,7 +62,7 @@ This configuration is:
6262
- offline (no web/distributed system scaffolding)
6363
- synchronous (all execution happens in a single blocking process)
6464
- single-GPU (no data/model/pipeline/expert parallelism; DP/TP/PP/EP = 1)
65-
- using standard transformer [[2]](https://www.aleksagordic.com/blog/vllm#ref-2) (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
65+
- using [standard transformer](https://arxiv.org/abs/1706.03762) (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
6666

6767
From here, we'll gradually build up to an online, async, multi-GPU, multi-node inference system - but still serving a standard transformer.
6868

@@ -73,7 +73,7 @@ In this example we do two things, we:
7373

7474
Let's start analyzing the constructor.
7575

76-
## LLM Engine constructor
76+
### LLM Engine constructor
7777

7878
The main components of the engine are:
7979

@@ -94,7 +94,7 @@ Engine core itself is made up of several sub components:
9494

9595
1. policy setting - it can be either **FCFS** (first come first served) or **priority** (higher priority requests are served first)
9696
2. `waiting` and `running` queues
97-
3. KV cache manager - the heart of paged attention [[3]](https://www.aleksagordic.com/blog/vllm#ref-3)
97+
3. KV cache manager - [the heart of paged attention](https://arxiv.org/abs/2309.06180)
9898

9999
The KV-cache manager maintains a `free_block_queue` - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100100

@@ -106,7 +106,7 @@ Figure 1. Core components described in this section and their relationships
106106

107107
!!! tip
108108

109-
Block size for a standard transformer layer (non-MLA [[4]](https://www.aleksagordic.com/blog/vllm#ref-4)) is computed as follows:
109+
Block size for a standard transformer layer ([non-MLA](https://arxiv.org/abs/2405.04434)) is computed as follows:
110110

111111
2 (key/value) * `block_size` (default=16) * `num_kv_heads` * `head_size` * `dtype_num_bytes` (e.g. 2 for bf16)
112112

@@ -129,7 +129,7 @@ During model executor construction, a `Worker` object is created, and three key
129129

130130
3. Initialize KV cache
131131

132-
- Get per-layer KV-cache spec. Historically this was always `FullAttentionSpec` (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [[5]](https://www.aleksagordic.com/blog/vllm#ref-5))
132+
- Get per-layer KV-cache spec. Historically this was always `FullAttentionSpec` (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see [Jenga](https://arxiv.org/abs/2503.18292))
133133
- Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
134134
- Allocate, reshape and bind KV cache tensors to attention layers
135135
- Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
@@ -139,7 +139,7 @@ I've abstracted away many low-level details here — but these are the core piec
139139

140140
Now that we have the engine initialized let's proceed to the `generate` function.
141141

142-
## Generate function
142+
### Generate function
143143

144144
The first step is to validate and feed requests into the engine. For each prompt we:
145145

@@ -148,7 +148,7 @@ The first step is to validate and feed requests into the engine. For each prompt
148148
3. Pack this info into an `EngineCoreRequest`, adding priority, sampling params, and other metadata
149149
4. Pass the request into the engine core, which wraps it in a `Request` object and sets its status to `WAITING`. This request is then added to the scheduler's `waiting` queue (append if FCFS, or heap-push if priority)
150150

151-
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka **continuous batching** [[6]](https://www.aleksagordic.com/blog/vllm#ref-6)): after each step, both new and old requests are considered.
151+
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka [continuous batching](https://www.usenix.org/conference/osdi22/presentation/yu)): after each step, both new and old requests are considered.
152152

153153
!!! tip
154154

@@ -179,7 +179,7 @@ Figure 2. Engine loop
179179

180180
Next, we'll examine scheduling in more detail.
181181

182-
## Scheduler
182+
### Scheduler
183183

184184
There are two main types of workloads an inference engine handles:
185185

@@ -219,7 +219,7 @@ Figure 3. list of KV cache blocks
219219

220220
We're finally ready to do a forward pass!
221221

222-
## Run forward pass
222+
### Run forward pass
223223

224224
We call model executor's `execute_model`, which delegates to the `Worker`, which in turn delegates to the model runner.
225225

@@ -258,7 +258,7 @@ Next, we'll dive into:
258258
4. Speculative decoding
259259
5. Disaggregated P/D (prefill/decoding)
260260

261-
## Chunked prefill
261+
### Chunked prefill
262262

263263
Chunked prefill is a technique for handling long prompts by splitting their prefill step into smaller chunks. Without it, we could end up with a single very long request monopolizing one engine step disallowing other prefill requests to run. That would postpone all other requests and increase their latency.
264264

@@ -272,7 +272,7 @@ Implementation is straightforward: cap the number of new tokens per step. If the
272272

273273
In vLLM V1, you enable chunked prefill by setting `long_prefill_token_threshold` to a positive integer. (Technically, it can happen irrespective of this, if the prompt length exceeds the token budget we truncate it and run a chunked prefill.)
274274

275-
## Prefix Caching
275+
### Prefix Caching
276276

277277
To explain how prefix caching works, let's take the original code example and tweak it a bit:
278278

@@ -357,7 +357,7 @@ And that's the gist of prefix caching: don't recompute prefixes you've already s
357357

358358
Prefix caching is enabled by default. To disable it: `enable_prefix_caching = False`.
359359

360-
## Guided Decoding (FSM)
360+
### Guided Decoding (FSM)
361361

362362
Guided decoding is a technique where, at each decoding step, the logits are constrained by a grammar-based finite state machine. This ensures that only tokens allowed by the grammar can be sampled.
363363

@@ -397,7 +397,7 @@ Figure 5. Toy example FSM
397397
How this works in vLLM:
398398

399399
1. At LLM engine construction, a `StructuredOutputManager` is created; it has access to the tokenizer and maintains a `_grammar_bitmask` tensor.
400-
2. When adding a request, its status is set to `WAITING_FOR_FSM` and `grammar_init` selects the backend compiler (e.g., `xgrammar` [[7]](https://www.aleksagordic.com/blog/vllm#ref-7); note that backends are 3rd party code).
400+
2. When adding a request, its status is set to `WAITING_FOR_FSM` and `grammar_init` selects the backend compiler (e.g., [`xgrammar`](https://arxiv.org/abs/2411.15100); note that backends are 3rd party code).
401401
3. The grammar for this request is compiled asynchronously.
402402
4. During scheduling, if the async compile has completed, the status switches to `WAITING` and `request_id` is added to `structured_output_request_ids`; otherwise it's placed in `skipped_waiting_requests` to retry on next engine step.
403403
5. After the scheduling loop (still inside scheduling), if there are FSM requests, the `StructuredOutputManager` asks the backend to prepare/update `_grammar_bitmask`.
@@ -422,11 +422,11 @@ Figure 6. Toy example
422422

423423
You can enable this in vLLM by passing in a desired `guided_decoding` config.
424424

425-
## Speculative Decoding
425+
### Speculative Decoding
426426

427427
In autoregressive generation, each new token requires a forward pass of the large LM. This is expensive — every step reloads and applies all model weights just to compute a single token! (assuming batch size == 1, in general it's `B`)
428428

429-
Speculative decoding [[8]](https://www.aleksagordic.com/blog/vllm#ref-8) speeds this up by introducing a smaller draft LM. The draft proposes `k` tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
429+
[Speculative decoding](https://arxiv.org/abs/2302.01318) speeds this up by introducing a smaller draft LM. The draft proposes `k` tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
430430

431431
Here are the steps:
432432

@@ -449,7 +449,7 @@ Here are the steps:
449449

450450
I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model.
451451

452-
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](https://www.aleksagordic.com/blog/vllm#ref-9), and Medusa [[10]](https://www.aleksagordic.com/blog/vllm#ref-10).
452+
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, [EAGLE](https://arxiv.org/abs/2401.15077), and [Medusa](https://arxiv.org/abs/2401.10774).
453453

454454
One-liners on each:
455455

@@ -514,7 +514,7 @@ The best way to internalize this is to fire up your debugger and step through th
514514

515515
![Verify stage & rejection sampling stage](https://www.aleksagordic.com/blog/vllm/specdec_pt2.png)
516516

517-
## Disaggregated P/D
517+
### Disaggregated P/D
518518

519519
I've already previously hinted at the motivation behind disaggregated P/D (prefill/decode).
520520

@@ -602,7 +602,7 @@ if __name__ == "__main__":
602602

603603
!!! note
604604

605-
I've also experimented with `LMCache` [[11]](https://www.aleksagordic.com/blog/vllm#ref-11), the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, `SharedStorageConnector` is a better choice for explanation.
605+
I've also experimented with [`LMCache`](https://github.com/LMCache/LMCache), the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, `SharedStorageConnector` is a better choice for explanation.
606606

607607
These are the steps in vLLM:
608608

@@ -730,7 +730,7 @@ vllm serve <model-name>
730730

731731
How does this work in VLLM?
732732

733-
## On the headless server node
733+
### On the headless server node
734734

735735
On the headless node, a `CoreEngineProcManager` launches 2 processes (per `--data-parallel-size-local`) each running `EngineCoreProc.run_engine_core`. Each of these functions creates a `DPEngineCoreProc` (the engine core) and then enters its busy loop.
736736

@@ -772,7 +772,7 @@ Figure 10. distributed system with 4 DP replicas running 4 DPEngineCoreProc
772772

773773
Now for the second part, what happens on the API server node?
774774

775-
## On the API server node
775+
### On the API server node
776776

777777
We instantiate an `AsyncLLM` object (an asyncio wrapper around the LLM engine). Internally this creates a `DPLBAsyncMPClient` (data-parallel, load-balancing, asynchronous, multiprocessing client).
778778

@@ -894,7 +894,7 @@ Figure 12. roofline perf model
894894

895895
For a more rigorous treatment, we have to account for kernel auto-tuning: as `B` grows, the runtime may switch to more efficient kernels for that shape, changing the achieved performance `P_kernel`. Step latency is `t = FLOPs_step / P_kernel`, where `FLOPs_step` is the work in the step. You can see that as `P_kernel` hits `P_peak` more compute per step will directly lead to an increase in latency.
896896

897-
## How to benchmark in vLLM
897+
### How to benchmark in vLLM
898898

899899
vLLM provides a `vllm bench {serve,latency,throughput}` CLI that wraps vllm / benchmarks / {server,latency,throughput}.py.
900900

@@ -940,12 +940,6 @@ I love understanding systems. Having said that, the resolution definitely suffer
940940

941941
If you spot any errors in the post, please DM me - feel free to drop me a message on [X](https://x.com/gordic_aleksa) or [LinkedIn](https://www.linkedin.com/in/aleksagordic/) or via [anon feedback](https://docs.google.com/forms/d/1z1fEirrN2xtGxAsJvptpM7yV4ByT5SF25S-XiMPrXNA/edit).
942942

943-
## Acknowledgements
944-
945-
A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me with H100s for my experiments over the past year!
946-
947-
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
948-
949943
## References
950944

951945
1. [vLLM](https://github.com/vllm-project/vllm)

0 commit comments

Comments
 (0)