Add actorder support for GPTQ block quantization#2616
Add actorder support for GPTQ block quantization#2616rk119 wants to merge 16 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe PR extends GPTQ's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the BLOCK quantization strategy within the GPTQ modifier. Key changes include updating the quantize_weight function to handle block-based divisors for group indices, refactoring the activation ordering logic to accommodate block structures, and ensuring group indices are correctly saved when required. Additionally, new test cases were added to verify the block quantization variant with different activation orderings. I have no feedback to provide as there are no review comments to evaluate.
|
The quality checks have failed. Please run |
|
|
||
| if not has_gidx: | ||
| g_idx = None | ||
| if actorder == ActivationOrdering.GROUP: |
There was a problem hiding this comment.
We can simplify this, only group act order saves g_idx, can just check for that
No need to check twice, just do this line on line 287 and remove g_idx_to_save
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py (1)
180-188:⚠️ Potential issue | 🟡 MinorThe assertions don't pin the new BLOCK behavior yet.
Relaxing this to
4 or 8means a deserialization fallback to any other 8-bit scheme would still pass. For the BLOCK recipes, assertweight_args.strategyandweight_args.block_structuretoo so this test actually verifies the path added by the PR.As per coding guidelines, "
tests/**/*.py: Ensure PyTest tests are clear, comprehensive, and cover edge cases for quantization scenarios."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py` around lines 180 - 188, The test currently only checks num_bits for weight_args allowing 4 or 8 which doesn't ensure the new BLOCK recipe path; update the assertions after obtaining weight_args (from quantization_config.config_groups["group_0"]) to also assert that weight_args.strategy equals the BLOCK strategy name used in your codebase (e.g., "BLOCK") and that weight_args.block_structure matches the expected structure (e.g., a tuple/list or specific object describing block dims) so the test verifies both strategy and block_structure in addition to num_bits; locate symbols quantization_config, quant_scheme, weight_args, QuantizationArgs and add assertions for weight_args.strategy and weight_args.block_structure accordingly.src/llmcompressor/modifiers/gptq/gptq_quantize.py (1)
263-287:⚠️ Potential issue | 🟠 MajorBLOCK strategy with activation ordering creates an untested, unsupported code path with cross-repository incompatibility.
When
strategy == QuantizationStrategy.BLOCKandactorder == ActivationOrdering.GROUP, the code savesweight_g_idx(line 275). However:
Incompatibility with compressed-tensors initialize.py: g_idx is only registered as a parameter for GROUP/TENSOR_GROUP strategies; BLOCK never registers
{base_name}_g_idx, so when loading, this saved g_idx cannot be restored to a module parameter.Missing g_idx handling in calibration flatten:
_flatten_weight()insrc/llmcompressor/observers/helpers.py:75-80only applies g_idx reordering for GROUP/TENSOR_GROUP strategies. For BLOCK, g_idx is ignored during flattening, even though it is passed fromimatrix.py:165.No test coverage: There are no tests combining BLOCK strategy with any
ActivationOrdering, making this a completely untested path.This indicates a design decision was incomplete: either BLOCK should not save g_idx when using activation ordering, or initialize.py and _flatten_weight should both be updated to support g_idx for BLOCK (with proper divisor-based grouping).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py` around lines 263 - 287, The BLOCK strategy currently writes weight_g_idx when actorder == ActivationOrdering.GROUP (in gptq_quantize.py) which is incompatible because compressed-tensors initialize.py does not register a {base_name}_g_idx for BLOCK and _flatten_weight (helpers._flatten_weight) ignores g_idx for BLOCK; either stop emitting weight_g_idx for QuantizationStrategy.BLOCK when using ActivationOrdering.GROUP (remove the g_idx_to_save assignment and related q_param_dict entry) or fully add BLOCK support: update initialize.py to register {base_name}_g_idx for BLOCK with the correct divisor/grouping logic, extend helpers._flatten_weight to apply g_idx reordering for BLOCK like GROUP/TENSOR_GROUP, ensure imatrix.py passes g_idx consistently, and add tests covering BLOCK combined with ActivationOrdering variations to validate serialization and calibration paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py`:
- Around line 99-127: The two test recipes recipe_modifier_full_block and
recipe_modifier_block_actorder_weight currently construct QuantizationArgs
without specifying type, so they default to QuantizationType.INT; update both
QuantizationArgs in the GPTQModifier config_groups to include
type=QuantizationType.FLOAT so the tests exercise FP8 block quantization
behavior (locate the QuantizationArgs instances inside GPTQModifier for group_0
targeting "re:.*model.layers.2.self_attn.q_proj$" and add the type field).
---
Outside diff comments:
In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py`:
- Around line 263-287: The BLOCK strategy currently writes weight_g_idx when
actorder == ActivationOrdering.GROUP (in gptq_quantize.py) which is incompatible
because compressed-tensors initialize.py does not register a {base_name}_g_idx
for BLOCK and _flatten_weight (helpers._flatten_weight) ignores g_idx for BLOCK;
either stop emitting weight_g_idx for QuantizationStrategy.BLOCK when using
ActivationOrdering.GROUP (remove the g_idx_to_save assignment and related
q_param_dict entry) or fully add BLOCK support: update initialize.py to register
{base_name}_g_idx for BLOCK with the correct divisor/grouping logic, extend
helpers._flatten_weight to apply g_idx reordering for BLOCK like
GROUP/TENSOR_GROUP, ensure imatrix.py passes g_idx consistently, and add tests
covering BLOCK combined with ActivationOrdering variations to validate
serialization and calibration paths.
In `@tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py`:
- Around line 180-188: The test currently only checks num_bits for weight_args
allowing 4 or 8 which doesn't ensure the new BLOCK recipe path; update the
assertions after obtaining weight_args (from
quantization_config.config_groups["group_0"]) to also assert that
weight_args.strategy equals the BLOCK strategy name used in your codebase (e.g.,
"BLOCK") and that weight_args.block_structure matches the expected structure
(e.g., a tuple/list or specific object describing block dims) so the test
verifies both strategy and block_structure in addition to num_bits; locate
symbols quantization_config, quant_scheme, weight_args, QuantizationArgs and add
assertions for weight_args.strategy and weight_args.block_structure accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5cf4ada2-ab75-44d8-b52d-d5b43b6187e9
📒 Files selected for processing (2)
src/llmcompressor/modifiers/gptq/gptq_quantize.pytests/llmcompressor/transformers/gptq/test_gptq_oneshot.py
|
See quality checks instructions and bot comments, otherwise looks pretty good just some minor fixes. We should also run some evals for this, once you have the changes in I can help you with that. You can reach out to me in vLLM slack for easier coordination |
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
Signed-off-by: Riffat Khan <riffatk342@gmail.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewsWaiting for:
This rule is failing.PRs labelled "two-reviews" must have at least two approving reviews before merging.
|
SUMMARY:
Depends on: Restrict group activation ordering to group quantization strategies compressed-tensors#682
Closes: Add weight activation ordering for fp8 block #2587
Replace the separate
block_column_idxcalculation in block quantization by reusingg_idx, first getting the divisor from either the group size value or the block width value. This should automatically ensure the use of actorder.Remove
has_gidxfrom and simplified the logic to save g_idx ifActivationOrdering.GROUPis opted for.TEST PLAN: