[torch.compile] Add compile-only mode by zou3519 · Pull Request #38675 · vllm-project/vllm

zou3519 · 2026-04-01T02:09:51Z

Summary

This PR is on the way to overlapping torch.compile and weight loading. See design doc and proof-of-concept PR and RFC

Adds vllm compile <model> [options] CLI command and vllm.compile_model() Python API
These APIs populate vLLM's torch.compile cache and do nothing else. A subsequent vllm serve call can read from the cache and perform a warm start.
They also use minimal GPU memory. This is accomplished through a combination of using FakeTensors (tensors with no storage that report device correctly) and Meta tensors (tensors with no storage that report device="meta"). Note that there is still some minimal GPU memory allocation (< 10 Mb, from GPUModelRunner runtime buffers), I did not go track down all of it, but I'm also not sure it matters.
In the future we can extend "vllm compile" to more than just torch.compile; for example, if vLLM uses triton kernels, or JIT'ed flashinfer kernels, vllm compile may also just compile those and saved the compiled artifacts somewhere.

How it works

FakeModelLoader. It initializes weights on meta device and runs weight post-processing on meta tensors. In the future this will probably need to wrap a real model loader but we don't need that for the current test cases.
Before torch.compile tracing, swap_meta_params_to_fake() converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices.
In theory we should also get the FakeModelLoader to give us FakeTensors, but FakeTensors do not yet support the type of Tensor subclasses that vLLM uses.
We raise the CompilationDone exception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs)
EngineCore early-returns after compile_or_warm_up_model, skipping KV cache allocation, scheduler, and sampler setup

Test plan

Added tests for compile_only cold start followed by a warm start. The tests verify that the compile_only cold start uses no GPU memory, and the warm start does end up reading from the cache.

Future work

In the following order:

add an option to overlap torch.compile and weight loading. The main process will do weight loading while spawning a new process to do compile-only work (that does not use gpu memory)
Extend this design to more weight loading schemes. For example, we currently support no weight processing. This will involve getting the additional weight processing to support meta tensors.

gemini-code-assist

Code Review

This pull request introduces a 'compile-only' mode to vLLM, allowing users to pre-populate the torch.compile cache using fake tensors without allocating significant GPU memory. It adds a new CLI subcommand, a specialized fake model loader, and logic to trigger compilation during a dummy profile run. Feedback focuses on ensuring compatibility with multi-module models by avoiding early termination via exceptions, refining path filtering for the compilation cache to prevent prefix collisions, and preserving parameter metadata during the transition from meta-tensors to fake tensors.

gemini-code-assist · 2026-04-01T02:14:31Z

vllm/compilation/backends.py

-        forward_code_files = list(sorted(self.compilation_config.traced_files))
+        # Filter out PyTorch internal files — they are already covered
+        # by the torch version in env_factors.
+        torch_root = os.path.dirname(torch.__file__)


The current logic for filtering PyTorch internal files can incorrectly exclude files from other packages whose paths start with the same prefix (e.g., torch_tensorrt if torch is in the same parent directory). Adding a trailing path separator ensures that only files within the torch package directory are filtered.

Suggested change

torch_root = os.path.dirname(torch.__file__)

torch_root = os.path.join(os.path.dirname(torch.__file__), "")

vllm/compilation/decorators.py

gemini-code-assist · 2026-04-01T02:14:31Z

vllm/model_executor/model_loader/fake_loader.py

+            parent.register_parameter(
+                attr,
+                nn.Parameter(fake_data, requires_grad=param.requires_grad),
+            )


Replacing Parameter subclasses with plain nn.Parameter loses critical metadata and type information that vLLM models often rely on during execution (e.g., quantization scales or specific logic in forward). This can lead to incorrect graph tracing if the model's behavior changes based on the parameter type. Consider preserving the original class and copying custom attributes.

Suggested change

parent.register_parameter(

attr,

nn.Parameter(fake_data, requires_grad=param.requires_grad),

)

new_param = param.__class__(fake_data, requires_grad=param.requires_grad)

for k, v in param.__dict__.items():

if not k.startswith("_"): setattr(new_param, k, v)

parent.register_parameter(attr, new_param)

this is intentional and the model behavior does not change based on the parameter type.

vllm/v1/worker/gpu_worker.py

Summary ======= This PR is on the way to overlapping torch.compile and weight loading. See [design doc]([https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0](https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0)) and [proof-of-concept PR](#36072) and [RFC](#34956) - Adds `vllm compile <model> [options]` CLI command and `vllm.compile_model()` Python API - These APIs populate vLLM's torch.compile cache and do nothing else. A subsequent `vllm serve` call can read from the cache and perform a warm start. - They also use minimal GPU memory. This is accomplished through a combination of using FakeTensors (tensors with no storage that report device correctly) and Meta tensors (tensors with no storage that report device="meta"). Note that there is still some minimal GPU memory allocation (< 10 Mb, from GPUModelRunner runtime buffers), I did not go track down all of it, but I'm also not sure it matters. - In the future we can extend "vllm compile" to more than just torch.compile; for example, if vLLM uses triton kernels, or JIT'ed flashinfer kernels, `vllm compile` may also just compile those and saved the compiled artifacts somewhere. How it works ============ - `FakeModelLoader` wraps the real model loader. It initializes weights on `meta` device and runs weight post-processing on meta tensors. - Before torch.compile tracing, `swap_meta_params_to_fake()` converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices. - In theory we should also get the FakeModelLoader to give us FakeTensors, but FakeTensors do not yet support the type of Tensor subclasses that vLLM uses. - We raise the `CompilationDone` exception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs) - `EngineCore` early-returns after `compile_or_warm_up_model`, skipping KV cache allocation, scheduler, and sampler setup Test plan ========= - Added tests for compile_only cold start followed by a warm start. The tests verify that the compile_only cold start uses no GPU memory, and the warm start does end up reading from the cache. Future work =========== In the following order: - add an option to overlap torch.compile and weight loading. The main process will do weight loading while spawning a new process to do compile-only work (that does not use gpu memory) - Extend this design to more weight loading schemes. For example, we currently support no weight processing. This will involve getting the additional weight processing to support meta tensors. Signed-off-by: Richard Zou <zou3519@gmail.com>

zou3519 requested review from 22quinn, BoyuanFeng, DarkLight1337, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, russellb, tlrmchlsmth, vadiklyutiy, yewentao256 and youkaichao as code owners April 1, 2026 02:09

mergify bot added frontend v1 labels Apr 1, 2026

zou3519 marked this pull request as draft April 1, 2026 02:11

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

zou3519 force-pushed the compile-only-pr1 branch 2 times, most recently from b079ca5 to 9e6e4e5 Compare April 1, 2026 02:17

zou3519 force-pushed the compile-only-pr1 branch from 9e6e4e5 to 8f4f842 Compare April 1, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Add compile-only mode#38675

[torch.compile] Add compile-only mode#38675
zou3519 wants to merge 1 commit intomainfrom
compile-only-pr1

zou3519 commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

zou3519 Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	torch_root = os.path.dirname(torch.__file__)
	torch_root = os.path.join(os.path.dirname(torch.__file__), "")

-            parent.register_parameter(
-                attr,
-                nn.Parameter(fake_data, requires_grad=param.requires_grad),
-            )
+            new_param = param.__class__(fake_data, requires_grad=param.requires_grad)
+            for k, v in param.__dict__.items():
+                if not k.startswith("_"): setattr(new_param, k, v)
+            parent.register_parameter(attr, new_param)

Uh oh!

Conversation

zou3519 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test plan

Future work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zou3519 commented Apr 1, 2026 •

edited

Loading