Skip to content

[model_free_ptq] Multi-gpu support, validate on meta model#2448

Draft
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/model_free_multi-gpu
Draft

[model_free_ptq] Multi-gpu support, validate on meta model#2448
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/model_free_multi-gpu

Conversation

@kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Mar 6, 2026

Purpose

  • Allow for increased job parallelism using model_free_ptq
  • Reduce runtime of validation step
  • Multiprocessing

Changes

  • Support passing multiple devices to model_free_ptq
  • Device allocation is handled by DeviceLoadBalancer, which attempts to spread jobs across devices as evenly as possible
  • Load tensors on meta device when perform validating step, rather than cpu device

Testing


Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the model_free_ptq functionality by introducing multi-GPU support and an intelligent device load balancing mechanism. These changes aim to improve the parallelism and efficiency of the quantization process, particularly for large models, while also optimizing the validation step by leveraging meta device loading to reduce memory overhead.

Highlights

  • Multi-GPU Support: The model_free_ptq function now accepts a list of devices, enabling parallel processing across multiple GPUs for quantization jobs.
  • Device Load Balancing: A new DeviceLoadBalancer class has been introduced to efficiently distribute quantization tasks among available devices, optimizing resource utilization and job parallelism.
  • Meta Device Validation: The validation step for safetensors files now loads tensors onto a "meta" device, reducing memory footprint and improving performance by avoiding unnecessary CPU loading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/entrypoints/model_free/init.py
    • Updated type hints to support lists and unions for device specification.
    • Replaced direct gpu_if_available usage with DeviceLoadBalancer instantiation.
    • Modified model_free_ptq to accept a list of devices and pass the DeviceLoadBalancer instance to processing jobs.
    • Exported DeviceLoadBalancer in __all__.
  • src/llmcompressor/entrypoints/model_free/device_balancer.py
    • Added DeviceLoadBalancer class to manage and distribute tasks across multiple GPU devices.
    • Implemented get_device and release_device methods for thread-safe device allocation.
    • Provided an inject_device decorator to automatically handle device lifecycle for decorated functions.
  • src/llmcompressor/entrypoints/model_free/process.py
    • Imported DeviceLoadBalancer and TYPE_CHECKING.
    • Modified validate_file to load safetensors onto the "meta" device.
    • Decorated process_file and process_file_microscale_scheme with @DeviceLoadBalancer.inject_device to integrate with the new load balancing system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-GPU support for model_free_ptq by adding a DeviceLoadBalancer class. This is a solid approach to parallelize the quantization process. The changes also include a performance optimization in the validation step by loading tensors on the meta device. My main feedback is around the inject_device decorator, which, while functional, could be replaced with a more explicit pattern to improve code clarity and maintainability.

Comment on lines +70 to +102
@staticmethod
def inject_device(func):
"""
Decorator that manages device lifecycle for functions.

The decorated function should have a 'device' parameter. When calling
the wrapped function, pass a DeviceLoadBalancer instance in place of
the device parameter. The decorator will automatically:
1. Get a device from the load balancer
2. Call the function with that device
3. Release the device when complete (even if an exception occurs)

:param func: Function to decorate (must have a 'device' parameter)
:return: Wrapped function that accepts load_balancer instead of device
"""

@functools.wraps(func)
def wrapper(*args, **kwargs):
signature = inspect.signature(func)
bound_args = signature.bind(*args, **kwargs)
bound_args.apply_defaults()
kwargs = dict(bound_args.arguments)

load_balancer: DeviceLoadBalancer = kwargs.pop("device")
device = load_balancer.get_device()
kwargs["device"] = device

try:
return func(**kwargs)
finally:
load_balancer.release_device(device)

return wrapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using a decorator to manage the device lifecycle is a clever approach, the current implementation of inject_device introduces ambiguity. It requires the decorated function's device parameter to accept a DeviceLoadBalancer instance at the call site, which is then replaced by a torch.device object within the function. This name-based argument type override can be confusing for developers and static analysis tools.

A more explicit and less magical pattern would be to remove the decorator and use a try...finally block directly in the functions that require a device. This would improve readability and maintainability.

For example, process_file in src/llmcompressor/entrypoints/model_free/process.py could be refactored as follows:

# No decorator here
def process_file(
    ...,
    load_balancer: "DeviceLoadBalancer",
):
    device = load_balancer.get_device()
    try:
        # original function body using `device`
        ...
    finally:
        load_balancer.release_device(device)

This approach is clearer and aligns with how validate_file handles the load_balancer argument, promoting consistency across the codebase.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs marked this pull request as draft March 6, 2026 07:19
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@mergify
Copy link
Contributor

mergify bot commented Mar 6, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@mergify
Copy link
Contributor

mergify bot commented Mar 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant