[model_free_ptq] Multi-gpu support, validate on meta model#2448
[model_free_ptq] Multi-gpu support, validate on meta model#2448
Conversation
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces multi-GPU support for model_free_ptq by adding a DeviceLoadBalancer class. This is a solid approach to parallelize the quantization process. The changes also include a performance optimization in the validation step by loading tensors on the meta device. My main feedback is around the inject_device decorator, which, while functional, could be replaced with a more explicit pattern to improve code clarity and maintainability.
| @staticmethod | ||
| def inject_device(func): | ||
| """ | ||
| Decorator that manages device lifecycle for functions. | ||
|
|
||
| The decorated function should have a 'device' parameter. When calling | ||
| the wrapped function, pass a DeviceLoadBalancer instance in place of | ||
| the device parameter. The decorator will automatically: | ||
| 1. Get a device from the load balancer | ||
| 2. Call the function with that device | ||
| 3. Release the device when complete (even if an exception occurs) | ||
|
|
||
| :param func: Function to decorate (must have a 'device' parameter) | ||
| :return: Wrapped function that accepts load_balancer instead of device | ||
| """ | ||
|
|
||
| @functools.wraps(func) | ||
| def wrapper(*args, **kwargs): | ||
| signature = inspect.signature(func) | ||
| bound_args = signature.bind(*args, **kwargs) | ||
| bound_args.apply_defaults() | ||
| kwargs = dict(bound_args.arguments) | ||
|
|
||
| load_balancer: DeviceLoadBalancer = kwargs.pop("device") | ||
| device = load_balancer.get_device() | ||
| kwargs["device"] = device | ||
|
|
||
| try: | ||
| return func(**kwargs) | ||
| finally: | ||
| load_balancer.release_device(device) | ||
|
|
||
| return wrapper |
There was a problem hiding this comment.
While using a decorator to manage the device lifecycle is a clever approach, the current implementation of inject_device introduces ambiguity. It requires the decorated function's device parameter to accept a DeviceLoadBalancer instance at the call site, which is then replaced by a torch.device object within the function. This name-based argument type override can be confusing for developers and static analysis tools.
A more explicit and less magical pattern would be to remove the decorator and use a try...finally block directly in the functions that require a device. This would improve readability and maintainability.
For example, process_file in src/llmcompressor/entrypoints/model_free/process.py could be refactored as follows:
# No decorator here
def process_file(
...,
load_balancer: "DeviceLoadBalancer",
):
device = load_balancer.get_device()
try:
# original function body using `device`
...
finally:
load_balancer.release_device(device)This approach is clearer and aligns with how validate_file handles the load_balancer argument, promoting consistency across the codebase.
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Changes
model_free_ptqDeviceLoadBalancer, which attempts to spread jobs across devices as evenly as possibleTesting