You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/zh/docs/en/blogs/2025/inside-vllm.md
+22-28Lines changed: 22 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
August 29, 2025
8
8
9
-
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [[1]](https://www.aleksagordic.com/blog/vllm#ref-1) works.
9
+
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how [vLLM](https://github.com/vllm-project/vllm) works.
10
10
11
11
This post is the first in a series. It starts broad and then layers in detail (following an inverse-pyramid approach) so you can form an accurate high-level mental model of the complete system without drowning in minutiae.
12
12
@@ -62,7 +62,7 @@ This configuration is:
62
62
- offline (no web/distributed system scaffolding)
63
63
- synchronous (all execution happens in a single blocking process)
64
64
- single-GPU (no data/model/pipeline/expert parallelism; DP/TP/PP/EP = 1)
65
-
- using standard transformer[[2]](https://www.aleksagordic.com/blog/vllm#ref-2) (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
65
+
- using [standard transformer](https://arxiv.org/abs/1706.03762) (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
66
66
67
67
From here, we'll gradually build up to an online, async, multi-GPU, multi-node inference system - but still serving a standard transformer.
68
68
@@ -73,7 +73,7 @@ In this example we do two things, we:
73
73
74
74
Let's start analyzing the constructor.
75
75
76
-
## LLM Engine constructor
76
+
###LLM Engine constructor
77
77
78
78
The main components of the engine are:
79
79
@@ -94,7 +94,7 @@ Engine core itself is made up of several sub components:
94
94
95
95
1. policy setting - it can be either **FCFS** (first come first served) or **priority** (higher priority requests are served first)
96
96
2.`waiting` and `running` queues
97
-
3. KV cache manager - the heart of paged attention[[3]](https://www.aleksagordic.com/blog/vllm#ref-3)
97
+
3. KV cache manager - [the heart of paged attention](https://arxiv.org/abs/2309.06180)
98
98
99
99
The KV-cache manager maintains a `free_block_queue` - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100
100
@@ -106,7 +106,7 @@ Figure 1. Core components described in this section and their relationships
106
106
107
107
!!! tip
108
108
109
-
Block size for a standard transformer layer (non-MLA [[4]](https://www.aleksagordic.com/blog/vllm#ref-4)) is computed as follows:
109
+
Block size for a standard transformer layer ([non-MLA](https://arxiv.org/abs/2405.04434)) is computed as follows:
@@ -129,7 +129,7 @@ During model executor construction, a `Worker` object is created, and three key
129
129
130
130
3. Initialize KV cache
131
131
132
-
- Get per-layer KV-cache spec. Historically this was always `FullAttentionSpec` (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [[5]](https://www.aleksagordic.com/blog/vllm#ref-5))
132
+
- Get per-layer KV-cache spec. Historically this was always `FullAttentionSpec` (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see [Jenga](https://arxiv.org/abs/2503.18292))
133
133
- Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
134
134
- Allocate, reshape and bind KV cache tensors to attention layers
135
135
- Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
@@ -139,7 +139,7 @@ I've abstracted away many low-level details here — but these are the core piec
139
139
140
140
Now that we have the engine initialized let's proceed to the `generate` function.
141
141
142
-
## Generate function
142
+
###Generate function
143
143
144
144
The first step is to validate and feed requests into the engine. For each prompt we:
145
145
@@ -148,7 +148,7 @@ The first step is to validate and feed requests into the engine. For each prompt
148
148
3. Pack this info into an `EngineCoreRequest`, adding priority, sampling params, and other metadata
149
149
4. Pass the request into the engine core, which wraps it in a `Request` object and sets its status to `WAITING`. This request is then added to the scheduler's `waiting` queue (append if FCFS, or heap-push if priority)
150
150
151
-
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka **continuous batching**[[6]](https://www.aleksagordic.com/blog/vllm#ref-6)): after each step, both new and old requests are considered.
151
+
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka [continuous batching](https://www.usenix.org/conference/osdi22/presentation/yu)): after each step, both new and old requests are considered.
152
152
153
153
!!! tip
154
154
@@ -179,7 +179,7 @@ Figure 2. Engine loop
179
179
180
180
Next, we'll examine scheduling in more detail.
181
181
182
-
## Scheduler
182
+
###Scheduler
183
183
184
184
There are two main types of workloads an inference engine handles:
185
185
@@ -219,7 +219,7 @@ Figure 3. list of KV cache blocks
219
219
220
220
We're finally ready to do a forward pass!
221
221
222
-
## Run forward pass
222
+
###Run forward pass
223
223
224
224
We call model executor's `execute_model`, which delegates to the `Worker`, which in turn delegates to the model runner.
225
225
@@ -258,7 +258,7 @@ Next, we'll dive into:
258
258
4. Speculative decoding
259
259
5. Disaggregated P/D (prefill/decoding)
260
260
261
-
## Chunked prefill
261
+
###Chunked prefill
262
262
263
263
Chunked prefill is a technique for handling long prompts by splitting their prefill step into smaller chunks. Without it, we could end up with a single very long request monopolizing one engine step disallowing other prefill requests to run. That would postpone all other requests and increase their latency.
264
264
@@ -272,7 +272,7 @@ Implementation is straightforward: cap the number of new tokens per step. If the
272
272
273
273
In vLLM V1, you enable chunked prefill by setting `long_prefill_token_threshold` to a positive integer. (Technically, it can happen irrespective of this, if the prompt length exceeds the token budget we truncate it and run a chunked prefill.)
274
274
275
-
## Prefix Caching
275
+
###Prefix Caching
276
276
277
277
To explain how prefix caching works, let's take the original code example and tweak it a bit:
278
278
@@ -357,7 +357,7 @@ And that's the gist of prefix caching: don't recompute prefixes you've already s
357
357
358
358
Prefix caching is enabled by default. To disable it: `enable_prefix_caching = False`.
359
359
360
-
## Guided Decoding (FSM)
360
+
###Guided Decoding (FSM)
361
361
362
362
Guided decoding is a technique where, at each decoding step, the logits are constrained by a grammar-based finite state machine. This ensures that only tokens allowed by the grammar can be sampled.
363
363
@@ -397,7 +397,7 @@ Figure 5. Toy example FSM
397
397
How this works in vLLM:
398
398
399
399
1. At LLM engine construction, a `StructuredOutputManager` is created; it has access to the tokenizer and maintains a `_grammar_bitmask` tensor.
400
-
2. When adding a request, its status is set to `WAITING_FOR_FSM` and `grammar_init` selects the backend compiler (e.g., `xgrammar`[[7]](https://www.aleksagordic.com/blog/vllm#ref-7); note that backends are 3rd party code).
400
+
2. When adding a request, its status is set to `WAITING_FOR_FSM` and `grammar_init` selects the backend compiler (e.g., [`xgrammar`](https://arxiv.org/abs/2411.15100); note that backends are 3rd party code).
401
401
3. The grammar for this request is compiled asynchronously.
402
402
4. During scheduling, if the async compile has completed, the status switches to `WAITING` and `request_id` is added to `structured_output_request_ids`; otherwise it's placed in `skipped_waiting_requests` to retry on next engine step.
403
403
5. After the scheduling loop (still inside scheduling), if there are FSM requests, the `StructuredOutputManager` asks the backend to prepare/update `_grammar_bitmask`.
@@ -422,11 +422,11 @@ Figure 6. Toy example
422
422
423
423
You can enable this in vLLM by passing in a desired `guided_decoding` config.
424
424
425
-
## Speculative Decoding
425
+
###Speculative Decoding
426
426
427
427
In autoregressive generation, each new token requires a forward pass of the large LM. This is expensive — every step reloads and applies all model weights just to compute a single token! (assuming batch size == 1, in general it's `B`)
428
428
429
-
Speculative decoding[[8]](https://www.aleksagordic.com/blog/vllm#ref-8) speeds this up by introducing a smaller draft LM. The draft proposes `k` tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
429
+
[Speculative decoding](https://arxiv.org/abs/2302.01318) speeds this up by introducing a smaller draft LM. The draft proposes `k` tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
430
430
431
431
Here are the steps:
432
432
@@ -449,7 +449,7 @@ Here are the steps:
449
449
450
450
I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model.
451
451
452
-
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](https://www.aleksagordic.com/blog/vllm#ref-9), and Medusa [[10]](https://www.aleksagordic.com/blog/vllm#ref-10).
452
+
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, [EAGLE](https://arxiv.org/abs/2401.15077), and [Medusa](https://arxiv.org/abs/2401.10774).
453
453
454
454
One-liners on each:
455
455
@@ -514,7 +514,7 @@ The best way to internalize this is to fire up your debugger and step through th
I've already previously hinted at the motivation behind disaggregated P/D (prefill/decode).
520
520
@@ -602,7 +602,7 @@ if __name__ == "__main__":
602
602
603
603
!!! note
604
604
605
-
I've also experimented with `LMCache` [[11]](https://www.aleksagordic.com/blog/vllm#ref-11), the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, `SharedStorageConnector` is a better choice for explanation.
605
+
I've also experimented with [`LMCache`](https://github.com/LMCache/LMCache), the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, `SharedStorageConnector` is a better choice for explanation.
606
606
607
607
These are the steps in vLLM:
608
608
@@ -730,7 +730,7 @@ vllm serve <model-name>
730
730
731
731
How does this work in VLLM?
732
732
733
-
## On the headless server node
733
+
###On the headless server node
734
734
735
735
On the headless node, a `CoreEngineProcManager` launches 2 processes (per `--data-parallel-size-local`) each running `EngineCoreProc.run_engine_core`. Each of these functions creates a `DPEngineCoreProc` (the engine core) and then enters its busy loop.
736
736
@@ -772,7 +772,7 @@ Figure 10. distributed system with 4 DP replicas running 4 DPEngineCoreProc
772
772
773
773
Now for the second part, what happens on the API server node?
774
774
775
-
## On the API server node
775
+
###On the API server node
776
776
777
777
We instantiate an `AsyncLLM` object (an asyncio wrapper around the LLM engine). Internally this creates a `DPLBAsyncMPClient` (data-parallel, load-balancing, asynchronous, multiprocessing client).
778
778
@@ -894,7 +894,7 @@ Figure 12. roofline perf model
894
894
895
895
For a more rigorous treatment, we have to account for kernel auto-tuning: as `B` grows, the runtime may switch to more efficient kernels for that shape, changing the achieved performance `P_kernel`. Step latency is `t = FLOPs_step / P_kernel`, where `FLOPs_step` is the work in the step. You can see that as `P_kernel` hits `P_peak` more compute per step will directly lead to an increase in latency.
896
896
897
-
## How to benchmark in vLLM
897
+
###How to benchmark in vLLM
898
898
899
899
vLLM provides a `vllm bench {serve,latency,throughput}` CLI that wraps vllm / benchmarks / {server,latency,throughput}.py.
900
900
@@ -940,12 +940,6 @@ I love understanding systems. Having said that, the resolution definitely suffer
940
940
941
941
If you spot any errors in the post, please DM me - feel free to drop me a message on [X](https://x.com/gordic_aleksa) or [LinkedIn](https://www.linkedin.com/in/aleksagordic/) or via [anon feedback](https://docs.google.com/forms/d/1z1fEirrN2xtGxAsJvptpM7yV4ByT5SF25S-XiMPrXNA/edit).
942
942
943
-
## Acknowledgements
944
-
945
-
A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me with H100s for my experiments over the past year!
946
-
947
-
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
0 commit comments