Skip to content

[torch.compile] Add compile-only mode#38675

Draft
zou3519 wants to merge 1 commit intomainfrom
compile-only-pr1
Draft

[torch.compile] Add compile-only mode#38675
zou3519 wants to merge 1 commit intomainfrom
compile-only-pr1

Conversation

@zou3519
Copy link
Copy Markdown
Collaborator

@zou3519 zou3519 commented Apr 1, 2026

Summary

This PR is on the way to overlapping torch.compile and weight loading. See design doc and proof-of-concept PR and RFC

  • Adds vllm compile <model> [options] CLI command and vllm.compile_model() Python API
  • These APIs populate vLLM's torch.compile cache and do nothing else. A subsequent vllm serve call can read from the cache and perform a warm start.
  • They also use minimal GPU memory. This is accomplished through a combination of using FakeTensors (tensors with no storage that report device correctly) and Meta tensors (tensors with no storage that report device="meta"). Note that there is still some minimal GPU memory allocation (< 10 Mb, from GPUModelRunner runtime buffers), I did not go track down all of it, but I'm also not sure it matters.
  • In the future we can extend "vllm compile" to more than just torch.compile; for example, if vLLM uses triton kernels, or JIT'ed flashinfer kernels, vllm compile may also just compile those and saved the compiled artifacts somewhere.

How it works

  • FakeModelLoader. It initializes weights on meta device and runs weight post-processing on meta tensors. In the future this will probably need to wrap a real model loader but we don't need that for the current test cases.
  • Before torch.compile tracing, swap_meta_params_to_fake() converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices.
  • In theory we should also get the FakeModelLoader to give us FakeTensors, but FakeTensors do not yet support the type of Tensor subclasses that vLLM uses.
  • We raise the CompilationDone exception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs)
  • EngineCore early-returns after compile_or_warm_up_model, skipping KV cache allocation, scheduler, and sampler setup

Test plan

  • Added tests for compile_only cold start followed by a warm start. The tests verify that the compile_only cold start uses no GPU memory, and the warm start does end up reading from the cache.

Future work

In the following order:

  • add an option to overlap torch.compile and weight loading. The main process will do weight loading while spawning a new process to do compile-only work (that does not use gpu memory)
  • Extend this design to more weight loading schemes. For example, we currently support no weight processing. This will involve getting the additional weight processing to support meta tensors.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'compile-only' mode to vLLM, allowing users to pre-populate the torch.compile cache using fake tensors without allocating significant GPU memory. It adds a new CLI subcommand, a specialized fake model loader, and logic to trigger compilation during a dummy profile run. Feedback focuses on ensuring compatibility with multi-module models by avoiding early termination via exceptions, refining path filtering for the compilation cache to prevent prefix collisions, and preserving parameter metadata during the transition from meta-tensors to fake tensors.

forward_code_files = list(sorted(self.compilation_config.traced_files))
# Filter out PyTorch internal files — they are already covered
# by the torch version in env_factors.
torch_root = os.path.dirname(torch.__file__)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for filtering PyTorch internal files can incorrectly exclude files from other packages whose paths start with the same prefix (e.g., torch_tensorrt if torch is in the same parent directory). Adding a trailing path separator ensures that only files within the torch package directory are filtered.

Suggested change
torch_root = os.path.dirname(torch.__file__)
torch_root = os.path.join(os.path.dirname(torch.__file__), "")

Comment on lines +119 to +122
parent.register_parameter(
attr,
nn.Parameter(fake_data, requires_grad=param.requires_grad),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Replacing Parameter subclasses with plain nn.Parameter loses critical metadata and type information that vLLM models often rely on during execution (e.g., quantization scales or specific logic in forward). This can lead to incorrect graph tracing if the model's behavior changes based on the parameter type. Consider preserving the original class and copying custom attributes.

Suggested change
parent.register_parameter(
attr,
nn.Parameter(fake_data, requires_grad=param.requires_grad),
)
new_param = param.__class__(fake_data, requires_grad=param.requires_grad)
for k, v in param.__dict__.items():
if not k.startswith("_"): setattr(new_param, k, v)
parent.register_parameter(attr, new_param)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional and the model behavior does not change based on the parameter type.

@zou3519 zou3519 force-pushed the compile-only-pr1 branch 2 times, most recently from b079ca5 to 9e6e4e5 Compare April 1, 2026 02:17
Summary
=======

This PR is on the way to overlapping torch.compile and weight loading. See [design doc]([https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0](https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0)) and [proof-of-concept PR](#36072) and [RFC](#34956)

- Adds `vllm compile <model> [options]` CLI command and `vllm.compile_model()` Python API
- These APIs populate vLLM's torch.compile cache and do nothing else. A subsequent `vllm serve` call can read from the cache and perform a warm start.
- They also use minimal GPU memory. This is accomplished through a combination of using FakeTensors (tensors with no storage that report device correctly) and Meta tensors (tensors with no storage that report device="meta"). Note that there is still some minimal GPU memory allocation (< 10 Mb, from GPUModelRunner runtime buffers), I did not go track down all of it, but I'm also not sure it matters.
- In the future we can extend "vllm compile" to more than just torch.compile; for example, if vLLM uses triton kernels, or JIT'ed flashinfer kernels, `vllm compile` may also just compile those and saved the compiled artifacts somewhere.

How it works
============
- `FakeModelLoader` wraps the real model loader. It initializes weights on `meta` device and runs weight post-processing on meta tensors.
- Before torch.compile tracing, `swap_meta_params_to_fake()` converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices.
- In theory we should also get the FakeModelLoader to give us FakeTensors, but FakeTensors do not yet support the type of Tensor subclasses that vLLM uses.
- We raise the `CompilationDone` exception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs)
- `EngineCore` early-returns after `compile_or_warm_up_model`, skipping KV cache allocation, scheduler, and sampler setup

Test plan
=========
- Added tests for compile_only cold start followed by a warm start. The tests verify that the compile_only cold start uses no GPU memory, and the warm start does end up reading from the cache.

Future work
===========
In the following order:
- add an option to overlap torch.compile and weight loading. The main process will do weight loading while spawning a new process to do compile-only work (that does not use gpu memory)
- Extend this design to more weight loading schemes. For example, we currently support no weight processing. This will involve getting the additional weight processing to support meta tensors.

Signed-off-by: Richard Zou <zou3519@gmail.com>
@zou3519 zou3519 force-pushed the compile-only-pr1 branch from 9e6e4e5 to 8f4f842 Compare April 1, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant