[model_free_ptq] Multi-gpu support, validate on meta model by kylesayrs · Pull Request #2448 · vllm-project/llm-compressor

kylesayrs · 2026-03-06T06:40:44Z

Purpose

Allow for increased job parallelism using model_free_ptq
Reduce runtime of validation step
Multiprocessing

Changes

Support passing multiple devices to model_free_ptq
Device allocation is handled by DeviceLoadBalancer, which attempts to spread jobs across devices as evenly as possible
Load tensors on meta device when perform validating step, rather than cpu device

Testing

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2026-03-06T06:40:54Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-06T06:41:02Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the model_free_ptq functionality by introducing multi-GPU support and an intelligent device load balancing mechanism. These changes aim to improve the parallelism and efficiency of the quantization process, particularly for large models, while also optimizing the validation step by leveraging meta device loading to reduce memory overhead.

Highlights

Multi-GPU Support: The model_free_ptq function now accepts a list of devices, enabling parallel processing across multiple GPUs for quantization jobs.
Device Load Balancing: A new DeviceLoadBalancer class has been introduced to efficiently distribute quantization tasks among available devices, optimizing resource utilization and job parallelism.
Meta Device Validation: The validation step for safetensors files now loads tensors onto a "meta" device, reducing memory footprint and improving performance by avoiding unnecessary CPU loading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/entrypoints/model_free/init.py
- Updated type hints to support lists and unions for device specification.
- Replaced direct gpu_if_available usage with DeviceLoadBalancer instantiation.
- Modified model_free_ptq to accept a list of devices and pass the DeviceLoadBalancer instance to processing jobs.
- Exported DeviceLoadBalancer in __all__.
src/llmcompressor/entrypoints/model_free/device_balancer.py
- Added DeviceLoadBalancer class to manage and distribute tasks across multiple GPU devices.
- Implemented get_device and release_device methods for thread-safe device allocation.
- Provided an inject_device decorator to automatically handle device lifecycle for decorated functions.
src/llmcompressor/entrypoints/model_free/process.py
- Imported DeviceLoadBalancer and TYPE_CHECKING.
- Modified validate_file to load safetensors onto the "meta" device.
- Decorated process_file and process_file_microscale_scheme with @DeviceLoadBalancer.inject_device to integrate with the new load balancing system.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces multi-GPU support for model_free_ptq by adding a DeviceLoadBalancer class. This is a solid approach to parallelize the quantization process. The changes also include a performance optimization in the validation step by loading tensors on the meta device. My main feedback is around the inject_device decorator, which, while functional, could be replaced with a more explicit pattern to improve code clarity and maintainability.

gemini-code-assist · 2026-03-06T06:43:32Z

src/llmcompressor/entrypoints/model_free/device_balancer.py

+    @staticmethod
+    def inject_device(func):
+        """
+        Decorator that manages device lifecycle for functions.
+
+        The decorated function should have a 'device' parameter. When calling
+        the wrapped function, pass a DeviceLoadBalancer instance in place of
+        the device parameter. The decorator will automatically:
+        1. Get a device from the load balancer
+        2. Call the function with that device
+        3. Release the device when complete (even if an exception occurs)
+
+        :param func: Function to decorate (must have a 'device' parameter)
+        :return: Wrapped function that accepts load_balancer instead of device
+        """
+
+        @functools.wraps(func)
+        def wrapper(*args, **kwargs):
+            signature = inspect.signature(func)
+            bound_args = signature.bind(*args, **kwargs)
+            bound_args.apply_defaults()
+            kwargs = dict(bound_args.arguments)
+
+            load_balancer: DeviceLoadBalancer = kwargs.pop("device")
+            device = load_balancer.get_device()
+            kwargs["device"] = device
+
+            try:
+                return func(**kwargs)
+            finally:
+                load_balancer.release_device(device)
+
+        return wrapper


While using a decorator to manage the device lifecycle is a clever approach, the current implementation of inject_device introduces ambiguity. It requires the decorated function's device parameter to accept a DeviceLoadBalancer instance at the call site, which is then replaced by a torch.device object within the function. This name-based argument type override can be confusing for developers and static analysis tools.

A more explicit and less magical pattern would be to remove the decorator and use a try...finally block directly in the functions that require a device. This would improve readability and maintainability.

For example, process_file in src/llmcompressor/entrypoints/model_free/process.py could be refactored as follows:

# No decorator here def process_file( ..., load_balancer: "DeviceLoadBalancer", ): device = load_balancer.get_device() try: # original function body using `device` ... finally: load_balancer.release_device(device)

This approach is clearer and aligns with how validate_file handles the load_balancer argument, promoting consistency across the codebase.

Signed-off-by: Kyle Sayers <[email protected]>

mergify · 2026-03-06T07:42:06Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-03-10T17:18:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ined parallelized partial reads (#2498) ## Purpose Eliminates the `reindex_fused_weights` preprocessing step for microscale schemes (NVFP4, MXFP4) by enabling each shard to be processed independently with full parallelism, even when fused weight sets (q/k/v, gate/up) span multiple shards. ## Approach Instead of grouping shards together (which reduces parallelism), each shard process fetches only the specific fused partner tensors it needs from other shards via targeted partial safetensors reads, computes the fused global scale locally, and writes only its own output shard. No cross-process coordination or file locking required. ## Changes ### `helpers.py` Added `build_tensor_file_index()` — reads `index.json` once at startup and builds a flat mapping of `tensor_name → resolved_file_path`. This gives each worker process an O(1) lookup to find which file contains any fused partner tensor, without re-scanning headers at runtime. ### `process.py` Updated `process_file_microscale_scheme()` with an optional `tensor_file_index` parameter. When provided: - `_fetch_fused_partners()` is called to identify any fused set members missing from the current shard, then fetches only those specific tensors via partial safetensors reads (headers + target tensors only, not full files) - Fused global scale is computed locally using all members of the fused set - `_belongs_to_shard()` ensures only native tensors are written to the output shard — fetched partner tensors are used for scale computation only and never written to the wrong shard ### `__init__.py` Simplified back to one job per shard — full parallelism restored. For microscale schemes, builds the `tensor_file_index` once from `index.json` and passes it to each job. No union-find, no grouping logic needed. ### `validate.py` Removed `NotImplementedError` for cross-shard fused weights — the case is now handled natively. Replaced with `logger.debug` noting that partner tensors will be resolved via partial reads. ## Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures ## Approach Each shard job receives a precomputed `inverse_weights_map` specifying exactly which tensors to load from which files. For cross-shard fused weights, only the shard owning the **primary** tensor (q_proj, gate_proj) fetches its partners — preventing double reads. All jobs share a unified signature for both standard and microscale schemes. ## Changes ### `microscale.py` - Refactor `DEFAULT_FUSED_MAPPINGS` from a list of lists to `{primary_pattern: [partner_templates]}` — only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weights - Move `build_inverse_weights_map()` here — uses regex match on primary patterns to construct partner names and locate them in other shards ### `process.py` - **Unified signature** for `validate_file`, `process_file`, and `process_file_microscale_scheme`: `(inverse_weights_map, save_path, scheme, ignore, device, converter)` - All functions use `safe_open` + `f.get_tensor()` for true partial reads - Partner tensors re-saved into requesting shard's output; caller updates safetensors index to reflect new locations ### `__init__.py` - Single `_get_weights_map()` helper handles both single-file and multi-file models (reads `safetensors.index.json` or scans file headers via `safe_open`) - Single `_build_quantization_jobs()` replaces separate standard/microscale builders — one job per shard with identical tuple structure for both - Validate jobs use `*job[1:]` for full future-proofing ### `helpers.py` - Removed `build_weights_map` and `build_inverse_weights_map` (moved to `microscale.py`) ### `validate.py` - Removed `NotImplementedError` for cross-shard fused weights — handled natively - Updated to reflect `inverse_weights_map`-based approach ## Testing - `pytest tests/llmcompressor/entrypoints/model_free/` — all passing locally - `make style && make quality` — all checks pass Signed-off-by: David Zheng <[email protected]> Closes #2497 Related to #2448 Signed-off-by: David Zheng <[email protected]>

…ined parallelized partial reads (vllm-project#2498) ## Purpose Eliminates the `reindex_fused_weights` preprocessing step for microscale schemes (NVFP4, MXFP4) by enabling each shard to be processed independently with full parallelism, even when fused weight sets (q/k/v, gate/up) span multiple shards. ## Approach Instead of grouping shards together (which reduces parallelism), each shard process fetches only the specific fused partner tensors it needs from other shards via targeted partial safetensors reads, computes the fused global scale locally, and writes only its own output shard. No cross-process coordination or file locking required. ## Changes ### `helpers.py` Added `build_tensor_file_index()` — reads `index.json` once at startup and builds a flat mapping of `tensor_name → resolved_file_path`. This gives each worker process an O(1) lookup to find which file contains any fused partner tensor, without re-scanning headers at runtime. ### `process.py` Updated `process_file_microscale_scheme()` with an optional `tensor_file_index` parameter. When provided: - `_fetch_fused_partners()` is called to identify any fused set members missing from the current shard, then fetches only those specific tensors via partial safetensors reads (headers + target tensors only, not full files) - Fused global scale is computed locally using all members of the fused set - `_belongs_to_shard()` ensures only native tensors are written to the output shard — fetched partner tensors are used for scale computation only and never written to the wrong shard ### `__init__.py` Simplified back to one job per shard — full parallelism restored. For microscale schemes, builds the `tensor_file_index` once from `index.json` and passes it to each job. No union-find, no grouping logic needed. ### `validate.py` Removed `NotImplementedError` for cross-shard fused weights — the case is now handled natively. Replaced with `logger.debug` noting that partner tensors will be resolved via partial reads. ## Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures ## Approach Each shard job receives a precomputed `inverse_weights_map` specifying exactly which tensors to load from which files. For cross-shard fused weights, only the shard owning the **primary** tensor (q_proj, gate_proj) fetches its partners — preventing double reads. All jobs share a unified signature for both standard and microscale schemes. ## Changes ### `microscale.py` - Refactor `DEFAULT_FUSED_MAPPINGS` from a list of lists to `{primary_pattern: [partner_templates]}` — only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weights - Move `build_inverse_weights_map()` here — uses regex match on primary patterns to construct partner names and locate them in other shards ### `process.py` - **Unified signature** for `validate_file`, `process_file`, and `process_file_microscale_scheme`: `(inverse_weights_map, save_path, scheme, ignore, device, converter)` - All functions use `safe_open` + `f.get_tensor()` for true partial reads - Partner tensors re-saved into requesting shard's output; caller updates safetensors index to reflect new locations ### `__init__.py` - Single `_get_weights_map()` helper handles both single-file and multi-file models (reads `safetensors.index.json` or scans file headers via `safe_open`) - Single `_build_quantization_jobs()` replaces separate standard/microscale builders — one job per shard with identical tuple structure for both - Validate jobs use `*job[1:]` for full future-proofing ### `helpers.py` - Removed `build_weights_map` and `build_inverse_weights_map` (moved to `microscale.py`) ### `validate.py` - Removed `NotImplementedError` for cross-shard fused weights — handled natively - Updated to reflect `inverse_weights_map`-based approach ## Testing - `pytest tests/llmcompressor/entrypoints/model_free/` — all passing locally - `make style && make quality` — all checks pass Signed-off-by: David Zheng <[email protected]> Closes vllm-project#2497 Related to vllm-project#2448 Signed-off-by: David Zheng <[email protected]> Signed-off-by: Ziming <[email protected]>

kylesayrs added 2 commits March 6, 2026 01:14

add device balancer

48d8f40

Signed-off-by: Kyle Sayers <[email protected]>

do validation on meta device

9408596

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested review from HDCharles, brian-dellabetta and dsikka as code owners March 6, 2026 06:40

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

support meta device

9619221

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as draft March 6, 2026 07:19

multiprocessing

bbf7587

Signed-off-by: Kyle Sayers <[email protected]>

mergify bot added the quality-failed label Mar 6, 2026

mergify bot added the needs-rebase label Mar 10, 2026

This was referenced Mar 20, 2026

[model_free_ptq] Runtime optimization: meta device shape validation, multi-GPU compression, reindexing elimination #2497

Closed

[Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads #2498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model_free_ptq] Multi-gpu support, validate on meta model#2448

[model_free_ptq] Multi-gpu support, validate on meta model#2448
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/model_free_multi-gpu

kylesayrs commented Mar 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kylesayrs commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kylesayrs commented Mar 6, 2026 •

edited

Loading