[model_free_ptq] Multi-gpu support, validate on meta model#2448
[model_free_ptq] Multi-gpu support, validate on meta model#2448
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces multi-GPU support for model_free_ptq by adding a DeviceLoadBalancer class. This is a solid approach to parallelize the quantization process. The changes also include a performance optimization in the validation step by loading tensors on the meta device. My main feedback is around the inject_device decorator, which, while functional, could be replaced with a more explicit pattern to improve code clarity and maintainability.
| @staticmethod | ||
| def inject_device(func): | ||
| """ | ||
| Decorator that manages device lifecycle for functions. | ||
|
|
||
| The decorated function should have a 'device' parameter. When calling | ||
| the wrapped function, pass a DeviceLoadBalancer instance in place of | ||
| the device parameter. The decorator will automatically: | ||
| 1. Get a device from the load balancer | ||
| 2. Call the function with that device | ||
| 3. Release the device when complete (even if an exception occurs) | ||
|
|
||
| :param func: Function to decorate (must have a 'device' parameter) | ||
| :return: Wrapped function that accepts load_balancer instead of device | ||
| """ | ||
|
|
||
| @functools.wraps(func) | ||
| def wrapper(*args, **kwargs): | ||
| signature = inspect.signature(func) | ||
| bound_args = signature.bind(*args, **kwargs) | ||
| bound_args.apply_defaults() | ||
| kwargs = dict(bound_args.arguments) | ||
|
|
||
| load_balancer: DeviceLoadBalancer = kwargs.pop("device") | ||
| device = load_balancer.get_device() | ||
| kwargs["device"] = device | ||
|
|
||
| try: | ||
| return func(**kwargs) | ||
| finally: | ||
| load_balancer.release_device(device) | ||
|
|
||
| return wrapper |
There was a problem hiding this comment.
While using a decorator to manage the device lifecycle is a clever approach, the current implementation of inject_device introduces ambiguity. It requires the decorated function's device parameter to accept a DeviceLoadBalancer instance at the call site, which is then replaced by a torch.device object within the function. This name-based argument type override can be confusing for developers and static analysis tools.
A more explicit and less magical pattern would be to remove the decorator and use a try...finally block directly in the functions that require a device. This would improve readability and maintainability.
For example, process_file in src/llmcompressor/entrypoints/model_free/process.py could be refactored as follows:
# No decorator here
def process_file(
...,
load_balancer: "DeviceLoadBalancer",
):
device = load_balancer.get_device()
try:
# original function body using `device`
...
finally:
load_balancer.release_device(device)This approach is clearer and aligns with how validate_file handles the load_balancer argument, promoting consistency across the codebase.
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
…ined parallelized partial reads (#2498) ## Purpose Eliminates the `reindex_fused_weights` preprocessing step for microscale schemes (NVFP4, MXFP4) by enabling each shard to be processed independently with full parallelism, even when fused weight sets (q/k/v, gate/up) span multiple shards. ## Approach Instead of grouping shards together (which reduces parallelism), each shard process fetches only the specific fused partner tensors it needs from other shards via targeted partial safetensors reads, computes the fused global scale locally, and writes only its own output shard. No cross-process coordination or file locking required. ## Changes ### `helpers.py` Added `build_tensor_file_index()` — reads `index.json` once at startup and builds a flat mapping of `tensor_name → resolved_file_path`. This gives each worker process an O(1) lookup to find which file contains any fused partner tensor, without re-scanning headers at runtime. ### `process.py` Updated `process_file_microscale_scheme()` with an optional `tensor_file_index` parameter. When provided: - `_fetch_fused_partners()` is called to identify any fused set members missing from the current shard, then fetches only those specific tensors via partial safetensors reads (headers + target tensors only, not full files) - Fused global scale is computed locally using all members of the fused set - `_belongs_to_shard()` ensures only native tensors are written to the output shard — fetched partner tensors are used for scale computation only and never written to the wrong shard ### `__init__.py` Simplified back to one job per shard — full parallelism restored. For microscale schemes, builds the `tensor_file_index` once from `index.json` and passes it to each job. No union-find, no grouping logic needed. ### `validate.py` Removed `NotImplementedError` for cross-shard fused weights — the case is now handled natively. Replaced with `logger.debug` noting that partner tensors will be resolved via partial reads. ## Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures ## Approach Each shard job receives a precomputed `inverse_weights_map` specifying exactly which tensors to load from which files. For cross-shard fused weights, only the shard owning the **primary** tensor (q_proj, gate_proj) fetches its partners — preventing double reads. All jobs share a unified signature for both standard and microscale schemes. ## Changes ### `microscale.py` - Refactor `DEFAULT_FUSED_MAPPINGS` from a list of lists to `{primary_pattern: [partner_templates]}` — only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weights - Move `build_inverse_weights_map()` here — uses regex match on primary patterns to construct partner names and locate them in other shards ### `process.py` - **Unified signature** for `validate_file`, `process_file`, and `process_file_microscale_scheme`: `(inverse_weights_map, save_path, scheme, ignore, device, converter)` - All functions use `safe_open` + `f.get_tensor()` for true partial reads - Partner tensors re-saved into requesting shard's output; caller updates safetensors index to reflect new locations ### `__init__.py` - Single `_get_weights_map()` helper handles both single-file and multi-file models (reads `safetensors.index.json` or scans file headers via `safe_open`) - Single `_build_quantization_jobs()` replaces separate standard/microscale builders — one job per shard with identical tuple structure for both - Validate jobs use `*job[1:]` for full future-proofing ### `helpers.py` - Removed `build_weights_map` and `build_inverse_weights_map` (moved to `microscale.py`) ### `validate.py` - Removed `NotImplementedError` for cross-shard fused weights — handled natively - Updated to reflect `inverse_weights_map`-based approach ## Testing - `pytest tests/llmcompressor/entrypoints/model_free/` — all passing locally - `make style && make quality` — all checks pass Signed-off-by: David Zheng <[email protected]> Closes #2497 Related to #2448 Signed-off-by: David Zheng <[email protected]>
…ined parallelized partial reads (vllm-project#2498) ## Purpose Eliminates the `reindex_fused_weights` preprocessing step for microscale schemes (NVFP4, MXFP4) by enabling each shard to be processed independently with full parallelism, even when fused weight sets (q/k/v, gate/up) span multiple shards. ## Approach Instead of grouping shards together (which reduces parallelism), each shard process fetches only the specific fused partner tensors it needs from other shards via targeted partial safetensors reads, computes the fused global scale locally, and writes only its own output shard. No cross-process coordination or file locking required. ## Changes ### `helpers.py` Added `build_tensor_file_index()` — reads `index.json` once at startup and builds a flat mapping of `tensor_name → resolved_file_path`. This gives each worker process an O(1) lookup to find which file contains any fused partner tensor, without re-scanning headers at runtime. ### `process.py` Updated `process_file_microscale_scheme()` with an optional `tensor_file_index` parameter. When provided: - `_fetch_fused_partners()` is called to identify any fused set members missing from the current shard, then fetches only those specific tensors via partial safetensors reads (headers + target tensors only, not full files) - Fused global scale is computed locally using all members of the fused set - `_belongs_to_shard()` ensures only native tensors are written to the output shard — fetched partner tensors are used for scale computation only and never written to the wrong shard ### `__init__.py` Simplified back to one job per shard — full parallelism restored. For microscale schemes, builds the `tensor_file_index` once from `index.json` and passes it to each job. No union-find, no grouping logic needed. ### `validate.py` Removed `NotImplementedError` for cross-shard fused weights — the case is now handled natively. Replaced with `logger.debug` noting that partner tensors will be resolved via partial reads. ## Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures ## Approach Each shard job receives a precomputed `inverse_weights_map` specifying exactly which tensors to load from which files. For cross-shard fused weights, only the shard owning the **primary** tensor (q_proj, gate_proj) fetches its partners — preventing double reads. All jobs share a unified signature for both standard and microscale schemes. ## Changes ### `microscale.py` - Refactor `DEFAULT_FUSED_MAPPINGS` from a list of lists to `{primary_pattern: [partner_templates]}` — only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weights - Move `build_inverse_weights_map()` here — uses regex match on primary patterns to construct partner names and locate them in other shards ### `process.py` - **Unified signature** for `validate_file`, `process_file`, and `process_file_microscale_scheme`: `(inverse_weights_map, save_path, scheme, ignore, device, converter)` - All functions use `safe_open` + `f.get_tensor()` for true partial reads - Partner tensors re-saved into requesting shard's output; caller updates safetensors index to reflect new locations ### `__init__.py` - Single `_get_weights_map()` helper handles both single-file and multi-file models (reads `safetensors.index.json` or scans file headers via `safe_open`) - Single `_build_quantization_jobs()` replaces separate standard/microscale builders — one job per shard with identical tuple structure for both - Validate jobs use `*job[1:]` for full future-proofing ### `helpers.py` - Removed `build_weights_map` and `build_inverse_weights_map` (moved to `microscale.py`) ### `validate.py` - Removed `NotImplementedError` for cross-shard fused weights — handled natively - Updated to reflect `inverse_weights_map`-based approach ## Testing - `pytest tests/llmcompressor/entrypoints/model_free/` — all passing locally - `make style && make quality` — all checks pass Signed-off-by: David Zheng <[email protected]> Closes vllm-project#2497 Related to vllm-project#2448 Signed-off-by: David Zheng <[email protected]> Signed-off-by: Ziming <[email protected]>
Purpose
Changes
model_free_ptqDeviceLoadBalancer, which attempts to spread jobs across devices as evenly as possibleTesting