Skip to content

Commit 391fc92

Browse files
authored
Merge branch 'main' into pr-1831-plus-fixes
2 parents 201d06c + 45b399c commit 391fc92

File tree

17 files changed

+360
-196
lines changed

17 files changed

+360
-196
lines changed

src/llmcompressor/modifiers/autoround/base.py

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -60,35 +60,39 @@ class AutoRoundModifier(Modifier, QuantizationMixin):
6060
This modifier leverages signed gradient descent (SignSGD) optimizer and
6161
block-wise loss to optimize rounding values and weight clipping in a few steps.
6262
63-
| Sample yaml:
64-
| test_stage:
65-
| modifiers:
66-
| AutoRoundModifier:
67-
| iters: 200
68-
| config_groups:
69-
| group_0:
70-
| targets:
71-
| - "Linear"
72-
| input_activations: null
73-
| output_activations: null
74-
| weights:
75-
| num_bits: 4
76-
| type: "int"
77-
| symmetric: true
78-
| strategy: group
79-
| group_size: 128
63+
Sample yaml:
64+
65+
```yaml
66+
test_stage:
67+
modifiers:
68+
AutoRoundModifier:
69+
iters: 200
70+
config_groups:
71+
group_0:
72+
targets:
73+
- "Linear"
74+
input_activations: null
75+
output_activations: null
76+
weights:
77+
num_bits: 4
78+
type: "int"
79+
symmetric: true
80+
strategy: group
81+
group_size: 128
82+
```
8083
8184
Lifecycle:
82-
- on_initialize
83-
- apply config to model
84-
- on_start
85-
- add input capture hooks to decoding layers
86-
- on_sequential_epoch_end
87-
- apply_autoround
88-
- post_autoround_cleanup
89-
- on_finalize
90-
- remove_hooks()
91-
- model.apply(freeze_module_quantization)
85+
86+
- on_initialize
87+
- apply config to model
88+
- on_start
89+
- add input capture hooks to decoding layers
90+
- on_sequential_epoch_end
91+
- apply_autoround
92+
- post_autoround_cleanup
93+
- on_finalize
94+
- remove_hooks()
95+
- model.apply(freeze_module_quantization)
9296
9397
:param config_groups: dictionary specifying quantization schemes to apply to target
9498
modules. Modules not matching a scheme target will NOT be quantized.

src/llmcompressor/modifiers/awq/base.py

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,6 @@ class AWQModifier(Modifier, QuantizationMixin):
5858
balance_layers: ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]
5959
- smooth_layer: "re:.*final_layer_norm"
6060
balance_layers: ["re:.*fc1"]
61-
]
6261
ignore: ["lm_head"]
6362
config_groups:
6463
group_0:
@@ -75,25 +74,26 @@ class AWQModifier(Modifier, QuantizationMixin):
7574
```
7675
7776
Lifecycle:
78-
- on_initialize
79-
- resolve mappings
80-
- capture kwargs needed for forward passes into modules
81-
- on_start
82-
- set up activation cache hooks to capture input activations
83-
to balance layers
84-
- on sequential epoch end
85-
- apply smoothing to each smoothing layer
86-
- consume cached activations across all batches
87-
- clear cached activations as they are used
88-
- find best smoothing scale for each smoothing layer
89-
- apply to model weights
90-
- raise error if any unused activations remain
91-
- on_end
92-
- re-run logic of sequential epoch end (in case of basic pipeline)
93-
- set scales and zero points
94-
- remove activation hooks
95-
- on_finalize
96-
- clear resolved mappings and captured activations
77+
78+
- on_initialize
79+
- resolve mappings
80+
- capture kwargs needed for forward passes into modules
81+
- on_start
82+
- set up activation cache hooks to capture input activations
83+
to balance layers
84+
- on sequential epoch end
85+
- apply smoothing to each smoothing layer
86+
- consume cached activations across all batches
87+
- clear cached activations as they are used
88+
- find best smoothing scale for each smoothing layer
89+
- apply to model weights
90+
- raise error if any unused activations remain
91+
- on_end
92+
- re-run logic of sequential epoch end (in case of basic pipeline)
93+
- set scales and zero points
94+
- remove activation hooks
95+
- on_finalize
96+
- clear resolved mappings and captured activations
9797
9898
:param sequential_targets: list of module names to compress in
9999
the same calibration pass

src/llmcompressor/modifiers/pruning/sparsegpt/base.py

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -26,24 +26,28 @@ class SparseGPTModifier(SparsityModifierBase):
2626
"""
2727
Modifier for applying the one-shot SparseGPT algorithm to a model
2828
29-
| Sample yaml:
30-
| test_stage:
31-
| obcq_modifiers:
32-
| SparseGPTModifier:
33-
| sparsity: 0.5
34-
| mask_structure: "2:4"
35-
| dampening_frac: 0.001
36-
| block_size: 128
37-
| targets: ['Linear']
38-
| ignore: ['re:.*lm_head']
29+
Sample yaml:
30+
31+
```yaml
32+
test_stage:
33+
obcq_modifiers:
34+
SparseGPTModifier:
35+
sparsity: 0.5
36+
mask_structure: "2:4"
37+
dampening_frac: 0.001
38+
block_size: 128
39+
targets: ['Linear']
40+
ignore: ['re:.*lm_head']
41+
```
3942
4043
Lifecycle:
41-
- on_initialize
42-
- register_hook(module, calibrate_module, "forward")
43-
- on_sequential_batch_end
44-
- sparsify_weight
45-
- on_finalize
46-
- remove_hooks()
44+
45+
- on_initialize
46+
- register_hook(module, calibrate_module, "forward")
47+
- on_sequential_batch_end
48+
- sparsify_weight
49+
- on_finalize
50+
- remove_hooks()
4751
4852
:param sparsity: Sparsity to compress model to
4953
:param sparsity_profile: Can be set to 'owl' to use Outlier Weighed
@@ -92,7 +96,7 @@ def calibrate_module(
9296
9397
:param module: module being calibrated
9498
:param args: inputs to the module, the first element of which is the
95-
cannonical input
99+
canonical input
96100
:param _output: uncompressed module output, unused
97101
"""
98102
# Assume that the first argument is the input

src/llmcompressor/modifiers/pruning/wanda/base.py

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -26,23 +26,27 @@ class WandaPruningModifier(SparsityModifierBase):
2626
Modifier for applying the one-shot WANDA algorithm to a model
2727
from the paper: https://arxiv.org/abs/2306.11695
2828
29-
| Sample yaml:
30-
| test_stage:
31-
| sparsity_modifiers:
32-
| WandaPruningModifier:
33-
| sparsity: 0.5
34-
| mask_structure: "2:4"
29+
Sample yaml:
30+
31+
```yaml
32+
test_stage:
33+
sparsity_modifiers:
34+
WandaPruningModifier:
35+
sparsity: 0.5
36+
mask_structure: "2:4"
37+
```
3538
3639
Lifecycle:
37-
- on_initialize
38-
- register_hook(module, calibrate_module, "forward")
39-
- run_sequential / run_basic
40-
- make_empty_row_scalars
41-
- accumulate_row_scalars
42-
- on_sequential_batch_end
43-
- sparsify_weight
44-
- on_finalize
45-
- remove_hooks()
40+
41+
- on_initialize
42+
- register_hook(module, calibrate_module, "forward")
43+
- run_sequential / run_basic
44+
- make_empty_row_scalars
45+
- accumulate_row_scalars
46+
- on_sequential_batch_end
47+
- sparsify_weight
48+
- on_finalize
49+
- remove_hooks()
4650
4751
:param sparsity: Sparsity to compress model to
4852
:param sparsity_profile: Can be set to 'owl' to use Outlier Weighed
@@ -78,7 +82,7 @@ def calibrate_module(
7882
7983
:param module: module being calibrated
8084
:param args: inputs to the module, the first element of which is the
81-
cannonical input
85+
canonical input
8286
:param _output: uncompressed module output, unused
8387
"""
8488
# Assume that the first argument is the input

src/llmcompressor/modifiers/quantization/gptq/base.py

Lines changed: 38 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -36,40 +36,44 @@ class GPTQModifier(Modifier, QuantizationMixin):
3636
"""
3737
Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier
3838
uses activations to calibrate a hessian matrix, which is then used to determine
39-
optimal quantizion values and orderings for the model weights.
40-
41-
| Sample yaml:
42-
| test_stage:
43-
| obcq_modifiers:
44-
| GPTQModifier:
45-
| block_size: 128
46-
| dampening_frac: 0.001
47-
| offload_hessians: False
48-
| actorder: static
49-
| config_groups:
50-
| group_0:
51-
| targets:
52-
| - "Linear"
53-
| input_activations: null
54-
| output_activations: null
55-
| weights:
56-
| num_bits: 8
57-
| type: "int"
58-
| symmetric: true
59-
| strategy: group
60-
| group_size: 128
39+
optimal quantization values and orderings for the model weights.
40+
41+
Sample yaml:
42+
43+
```yaml
44+
test_stage:
45+
obcq_modifiers:
46+
GPTQModifier:
47+
block_size: 128
48+
dampening_frac: 0.001
49+
offload_hessians: False
50+
actorder: static
51+
config_groups:
52+
group_0:
53+
targets:
54+
- "Linear"
55+
input_activations: null
56+
output_activations: null
57+
weights:
58+
num_bits: 8
59+
type: "int"
60+
symmetric: true
61+
strategy: group
62+
group_size: 128
63+
```
6164
6265
Lifecycle:
63-
- on_initialize
64-
- apply config to model
65-
- on_start
66-
- add activation calibration hooks
67-
- add gptq weight calibration hooks
68-
- on_sequential_epoch_end
69-
- quantize_weight
70-
- on_finalize
71-
- remove_hooks()
72-
- model.apply(freeze_module_quantization)
66+
67+
- on_initialize
68+
- apply config to model
69+
- on_start
70+
- add activation calibration hooks
71+
- add gptq weight calibration hooks
72+
- on_sequential_epoch_end
73+
- quantize_weight
74+
- on_finalize
75+
- remove_hooks()
76+
- model.apply(freeze_module_quantization)
7377
7478
:param sequential_targets: list of layer names to compress during GPTQ, or
7579
'__ALL__' to compress every layer in the model
@@ -99,7 +103,7 @@ class GPTQModifier(Modifier, QuantizationMixin):
99103
the kv_cache_scheme gets converted into a QuantizationScheme that:
100104
- targets the `q_proj` and `k_proj` modules of the model. The outputs
101105
of those modules are the keys and values that might be cached
102-
- quantizes the outputs of the aformentioned layers, so that
106+
- quantizes the outputs of the aforementioned layers, so that
103107
keys and values are compressed before storing them in the cache
104108
There is an explicit assumption that the model contains modules with
105109
`k_proj` and `v_proj` in their names. If this is not the case
@@ -220,7 +224,7 @@ def calibrate_module(
220224
221225
:param module: module being calibrated
222226
:param args: inputs to the module, the first element of which is the
223-
cannonical input
227+
canonical input
224228
:param _output: uncompressed module output, unused
225229
"""
226230
# Assume that first argument is the input

src/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,7 @@ def _apply_activation_ordering(
286286
W: torch.Tensor, H: torch.Tensor
287287
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
288288
"""
289-
Permute weight and hessian in order of greatest outupt activations
289+
Permute weight and hessian in order of greatest output activations
290290
291291
:param W: weight to permute
292292
:param H: hessian used to determine activation ordering

src/llmcompressor/modifiers/quantization/quantization/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ class QuantizationModifier(Modifier, QuantizationMixin):
3737
the kv_cache_scheme gets converted into a QuantizationScheme that:
3838
- targets the `q_proj` and `k_proj` modules of the model. The outputs
3939
of those modules are the keys and values that might be cached
40-
- quantizes the outputs of the aformentioned layers, so that
40+
- quantizes the outputs of the aforementioned layers, so that
4141
keys and values are compressed before storing them in the cache
4242
There is an explicit assumption that the model contains modules with
4343
`k_proj` and `v_proj` in their names. If this is not the case

src/llmcompressor/modifiers/quantization/quantization/mixin.py

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -41,26 +41,28 @@
4141

4242
class QuantizationMixin(HooksMixin):
4343
"""
44-
Mixin which enables a Modifier to act as a quantization config, attching observers,
44+
Mixin which enables a Modifier to act as a quantization config, attaching observers,
4545
calibration hooks, and compression wrappers to modifiers
4646
4747
Lifecycle:
48-
- on_initialize: QuantizationMixin.initialize_quantization
49-
- Attach schemes to modules
50-
- Attach observers to modules
51-
- Disable quantization until calibration starts/finishes
52-
- on_start: QuantizationMixin.start_calibration
53-
- Attach calibration hooks
54-
- Apply calibration status
55-
- Enable quantization during calibration
56-
- on_end: QuantizationMixin.end_calibration
57-
- Remove calibration hooks
58-
- Apply freeze status
59-
- Keep quantization enabled for future steps
60-
NOTE: QuantizationMixin does not update scales and zero-points on its own,
61-
as this is not desired for all Modifiers inheriting from it. Modifier must
62-
explicitly call `update_weight_zp_scale`.
63-
See QuantizationModifier.on_start method for example
48+
49+
- on_initialize: QuantizationMixin.initialize_quantization
50+
- Attach schemes to modules
51+
- Attach observers to modules
52+
- Disable quantization until calibration starts/finishes
53+
- on_start: QuantizationMixin.start_calibration
54+
- Attach calibration hooks
55+
- Apply calibration status
56+
- Enable quantization during calibration
57+
- on_end: QuantizationMixin.end_calibration
58+
- Remove calibration hooks
59+
- Apply freeze status
60+
- Keep quantization enabled for future steps
61+
62+
NOTE: QuantizationMixin does not update scales and zero-points on its own,
63+
as this is not desired for all Modifiers inheriting from it. Modifier must
64+
explicitly call `update_weight_zp_scale`.
65+
See QuantizationModifier.on_start method for example
6466
6567
:param config_groups: dictionary specifying quantization schemes to apply to target
6668
modules. Modules not matching a scheme target will NOT be quantized.
@@ -83,7 +85,7 @@ class QuantizationMixin(HooksMixin):
8385
the kv_cache_scheme gets converted into a QuantizationScheme that:
8486
- targets the `q_proj` and `k_proj` modules of the model. The outputs
8587
of those modules are the keys and values that might be cached
86-
- quantizes the outputs of the aformentioned layers, so that
88+
- quantizes the outputs of the aforementioned layers, so that
8789
keys and values are compressed before storing them in the cache
8890
There is an explicit assumption that the model contains modules with
8991
`k_proj` and `v_proj` in their names. If this is not the case

0 commit comments

Comments
 (0)