Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a 'compile-only' mode to vLLM, allowing users to pre-populate the torch.compile cache using fake tensors without allocating significant GPU memory. It adds a new CLI subcommand, a specialized fake model loader, and logic to trigger compilation during a dummy profile run. Feedback focuses on ensuring compatibility with multi-module models by avoiding early termination via exceptions, refining path filtering for the compilation cache to prevent prefix collisions, and preserving parameter metadata during the transition from meta-tensors to fake tensors.
vllm/compilation/backends.py
Outdated
| forward_code_files = list(sorted(self.compilation_config.traced_files)) | ||
| # Filter out PyTorch internal files — they are already covered | ||
| # by the torch version in env_factors. | ||
| torch_root = os.path.dirname(torch.__file__) |
There was a problem hiding this comment.
The current logic for filtering PyTorch internal files can incorrectly exclude files from other packages whose paths start with the same prefix (e.g., torch_tensorrt if torch is in the same parent directory). Adding a trailing path separator ensures that only files within the torch package directory are filtered.
| torch_root = os.path.dirname(torch.__file__) | |
| torch_root = os.path.join(os.path.dirname(torch.__file__), "") |
| parent.register_parameter( | ||
| attr, | ||
| nn.Parameter(fake_data, requires_grad=param.requires_grad), | ||
| ) |
There was a problem hiding this comment.
Replacing Parameter subclasses with plain nn.Parameter loses critical metadata and type information that vLLM models often rely on during execution (e.g., quantization scales or specific logic in forward). This can lead to incorrect graph tracing if the model's behavior changes based on the parameter type. Consider preserving the original class and copying custom attributes.
| parent.register_parameter( | |
| attr, | |
| nn.Parameter(fake_data, requires_grad=param.requires_grad), | |
| ) | |
| new_param = param.__class__(fake_data, requires_grad=param.requires_grad) | |
| for k, v in param.__dict__.items(): | |
| if not k.startswith("_"): setattr(new_param, k, v) | |
| parent.register_parameter(attr, new_param) |
There was a problem hiding this comment.
this is intentional and the model behavior does not change based on the parameter type.
b079ca5 to
9e6e4e5
Compare
Summary ======= This PR is on the way to overlapping torch.compile and weight loading. See [design doc]([https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0](https://docs.google.com/document/d/1hssZeQv_lJlKqOr0vfpoqEUNn4ASGHVRB9a49yhcY6E/edit?tab=t.0)) and [proof-of-concept PR](#36072) and [RFC](#34956) - Adds `vllm compile <model> [options]` CLI command and `vllm.compile_model()` Python API - These APIs populate vLLM's torch.compile cache and do nothing else. A subsequent `vllm serve` call can read from the cache and perform a warm start. - They also use minimal GPU memory. This is accomplished through a combination of using FakeTensors (tensors with no storage that report device correctly) and Meta tensors (tensors with no storage that report device="meta"). Note that there is still some minimal GPU memory allocation (< 10 Mb, from GPUModelRunner runtime buffers), I did not go track down all of it, but I'm also not sure it matters. - In the future we can extend "vllm compile" to more than just torch.compile; for example, if vLLM uses triton kernels, or JIT'ed flashinfer kernels, `vllm compile` may also just compile those and saved the compiled artifacts somewhere. How it works ============ - `FakeModelLoader` wraps the real model loader. It initializes weights on `meta` device and runs weight post-processing on meta tensors. - Before torch.compile tracing, `swap_meta_params_to_fake()` converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices. - In theory we should also get the FakeModelLoader to give us FakeTensors, but FakeTensors do not yet support the type of Tensor subclasses that vLLM uses. - We raise the `CompilationDone` exception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs) - `EngineCore` early-returns after `compile_or_warm_up_model`, skipping KV cache allocation, scheduler, and sampler setup Test plan ========= - Added tests for compile_only cold start followed by a warm start. The tests verify that the compile_only cold start uses no GPU memory, and the warm start does end up reading from the cache. Future work =========== In the following order: - add an option to overlap torch.compile and weight loading. The main process will do weight loading while spawning a new process to do compile-only work (that does not use gpu memory) - Extend this design to more weight loading schemes. For example, we currently support no weight processing. This will involve getting the additional weight processing to support meta tensors. Signed-off-by: Richard Zou <zou3519@gmail.com>
9e6e4e5 to
8f4f842
Compare
Summary
This PR is on the way to overlapping torch.compile and weight loading. See design doc and proof-of-concept PR and RFC
vllm compile <model> [options]CLI command andvllm.compile_model()Python APIvllm servecall can read from the cache and perform a warm start.vllm compilemay also just compile those and saved the compiled artifacts somewhere.How it works
FakeModelLoader. It initializes weights onmetadevice and runs weight post-processing on meta tensors. In the future this will probably need to wrap a real model loader but we don't need that for the current test cases.swap_meta_params_to_fake()converts meta params to FakeTensors. This is required so that torch.compile sees Tensors with the correct devices.CompilationDoneexception after cache artifacts are saved to avoid executing with fake tensors. (calling torch.compile performs both the compilation and an initial run of invoking the compiled artifact with the inputs)EngineCoreearly-returns aftercompile_or_warm_up_model, skipping KV cache allocation, scheduler, and sampler setupTest plan
Future work
In the following order: