Feat (llm/awq): activation-aware weight scaling #1213

pablomlago · 2025-03-07T12:05:18Z

Reason for this PR

Implementation of AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Using weight-only quantization and the configuration:

weight_bit_width: 3
weight_group_size: 128
weight_quant_granularity: per_group
weight_quant_type: asym
scaling_min_val: 0.00001
quantize_weight_zero_point: true

	OPT-125M	Llama3 1B
Float16	23.77	8.77
RTN	45.72	34.38
AWQ repo*	31.53	15.16
AWQ scale	33.97	19.39
AWQ clip	34.22	20.80
AWQ scale+clip	31.53	15.77

*Minor differences observed in perplexity between the original repository and Brevitas are due to order of operations/differences in quantizers.

Changes Made in this PR

Created a dataclass RegionAWQ , inheriting from Region to aggregate the information of the modules s on which AWQ optimizes the scale.
Adapted auto_scale and auto_clip to rely on Brevitas quantizers.

Testing Summary

Testing apply_awq against the author's repository.

Risk Highlight

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

src/brevitas/graph/calibrate.py

Giuseppe5 · 2025-03-10T14:21:20Z

src/brevitas/graph/equalize.py

@@ -780,9 +781,11 @@ def _no_equalize():
    for module in chain(src_axes.values(), sink_axes.values()):
        rewriters.extend(module.instantiate_rewriters(rewriter_class, scaling_factors))

-    # Apply rewriters before offloading
+    # Apply rewriters before offloading, if parametrize_inplace is True.  Note that parametrizations
+    # are not immediately to prevent potential errors if the model is offloaded.


Can you elaborate a bit more the issue here?

Giuseppe5 · 2025-03-10T14:26:48Z

src/brevitas_examples/llm/llm_quant/awq/pre_quant.py

            raise ValueError  # early exit to break later inference

    # patch layer 0 to catch input and kwargs
-    layers[0] = Catcher(layers[0])
+    blocks[0] = Catcher(blocks[0])


I don't think we need this part of the codebase, why can't we do what we do in GPTQ to catch the input to the first block?
We can also move that piece of code to some utils in exmples/common/generative

Giuseppe5 · 2025-03-10T14:29:21Z

src/brevitas/utils/python_utils.py

@@ -64,3 +65,30 @@ def run(*args, **kwargs):
        return function(*args, **kwargs)

    return run
+
+
+def longest_common_prefix(strings: List[str]):


This seems overly specific to AWQ, not sure if this should live here

Giuseppe5 · 2025-03-10T14:34:28Z

src/brevitas_examples/llm/llm_quant/awq/utils/region.py

+                    "ffn.act": block.ffn.act,
+                    "ffn.down_proj": block.ffn.down_proj,},
+            ))
+    elif "falcon" in str(block.__class__).lower():


Only Llama for now

src/brevitas/graph/calibrate.py

Giuseppe5 · 2025-04-01T13:15:48Z

src/brevitas_examples/llm/llm_args.py

@@ -370,6 +370,18 @@ def create_llm_args_parser():
        default=[],
        nargs='*',
        help='A list of module names to expand with hadamard rotation. Default: %(default)s')
+    parser.add_argument(


Giuseppe5 · 2025-04-07T08:32:36Z

src/brevitas/graph/calibrate.py

@@ -251,6 +287,65 @@ def apply(self, model, is_training, quantization_enabled):
            self.enable_param_quantization(model, is_training)


+class disable_enable_quantization:


We have another class that does this as well, not in a context manager fashion.

I think we might consider just switching to this new class everywhere?
The main consideration is that we need handle disabling quantization for activation calibration

I'll handle it in a separate PR.

Giuseppe5 · 2025-04-07T08:42:58Z

src/brevitas/graph/equalize.py

    for r in rewriters:
-        model = r.apply(model)
+        if parametrize_inplace or not isinstance(r, ModuleInstanceRegisterParametrization):


I don't understand this. The comment above doesn't address the parametrize_inplace flag and how the two combines.

It was a leftover. I've removed it.

src/brevitas_examples/llm/main.py

src/brevitas_examples/common/generative/quantize.py

src/brevitas_examples/llm/llm_quant/awq/auto_clip.py

pablomlago marked this pull request as ready for review March 10, 2025 12:09

pablomlago requested a review from Giuseppe5 March 10, 2025 12:09

pablomlago changed the title ~~[DRAFT] Feat (llm/awq): activation-aware weight scaling~~ Feat (llm/awq): activation-aware weight scaling Mar 10, 2025

Giuseppe5 reviewed Mar 10, 2025

View reviewed changes

src/brevitas/graph/calibrate.py Outdated Show resolved Hide resolved

Giuseppe5 reviewed Mar 10, 2025

View reviewed changes

pablomlago requested a review from Giuseppe5 March 12, 2025 09:44

nickfraser added the next release PRs which should be merged for the next release label Mar 20, 2025

Giuseppe5 reviewed Apr 1, 2025

View reviewed changes

src/brevitas/graph/calibrate.py Outdated Show resolved Hide resolved

Giuseppe5 reviewed Apr 1, 2025

View reviewed changes

pablomlago force-pushed the feat-llm-awq branch from 824b0f0 to 0d9e1c9 Compare April 3, 2025 07:12

Giuseppe5 reviewed Apr 7, 2025

View reviewed changes

pablomlago force-pushed the feat-llm-awq branch from 9789efe to 8a155a5 Compare April 15, 2025 14:43

i-colbert reviewed Apr 15, 2025

View reviewed changes

src/brevitas_examples/llm/main.py Outdated Show resolved Hide resolved

pablomlago force-pushed the feat-llm-awq branch 3 times, most recently from 24f93ee to 500fe77 Compare April 25, 2025 14:59

Giuseppe5 reviewed May 5, 2025

View reviewed changes

src/brevitas_examples/common/generative/quantize.py Outdated Show resolved Hide resolved

Giuseppe5 reviewed May 5, 2025

View reviewed changes

src/brevitas_examples/llm/llm_quant/awq/auto_clip.py Show resolved Hide resolved

pablomlago force-pushed the feat-llm-awq branch from ed2bb7b to 4c23315 Compare May 7, 2025 09:34

pablomlago added 7 commits May 7, 2025 18:16

AWQ initial commit

a581724

Remove clipping parametrization

1684f65

Remove group size in EqualizeAWQ

1cfc44c

Ensure appropiate model placement

5b553f8

Minor changes

0d5c38f

Bug fixes

aba6143

Minor improvements

beb6581

pablomlago added 13 commits May 7, 2025 18:16

Minor fix

68dfa16

Minor fix

66414ad

Remove unused logic

b71824b

Fix bugs

b39f15e

Update README

c6ad46f

Remove unnecessary flag

0c5ee73

Refactoring

0fafa7a

Blocks function

a7ac7c0

Fix merge conflicts

b7e779e

Update llm_test_template

a6c2002

Fix condition to only handle offloaded models

9f8228f

Address comments

044012f

Shorten comment

8b4afe5

pablomlago force-pushed the feat-llm-awq branch from 4c23315 to 8b4afe5 Compare May 7, 2025 17:23

pablomlago and others added 2 commits May 7, 2025 18:31

Revert change

9d7c06b

Update zero point condition

c59bae1

Giuseppe5 self-requested a review May 8, 2025 08:12

pablomlago added 2 commits May 8, 2025 22:59

Fix quantization status manager

5058cbd

Update preprocessing

56594a6

Giuseppe5 requested review from Giuseppe5 and removed request for Giuseppe5 May 9, 2025 08:11

Giuseppe5 merged commit 18c601b into Xilinx:dev May 9, 2025
395 of 396 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat (llm/awq): activation-aware weight scaling #1213

Feat (llm/awq): activation-aware weight scaling #1213

Uh oh!

pablomlago commented Mar 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Giuseppe5 Mar 10, 2025

Uh oh!

Giuseppe5 Mar 10, 2025

Uh oh!

Giuseppe5 Mar 10, 2025

Uh oh!

Giuseppe5 Mar 10, 2025

Uh oh!

Uh oh!

Giuseppe5 Apr 1, 2025

Uh oh!

Giuseppe5 Apr 7, 2025

Uh oh!

pablomlago Apr 15, 2025

Uh oh!

Giuseppe5 Apr 7, 2025

Uh oh!

pablomlago Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -251,6 +287,65 @@ def apply(self, model, is_training, quantization_enabled):
		self.enable_param_quantization(model, is_training)


		class disable_enable_quantization:

Feat (llm/awq): activation-aware weight scaling #1213

Feat (llm/awq): activation-aware weight scaling #1213

Uh oh!

Conversation

pablomlago commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablomlago commented Mar 7, 2025 •

edited

Loading