Skip to content

Feat (llm/awq): activation-aware weight scaling #1213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from

Conversation

pablomlago
Copy link
Collaborator

@pablomlago pablomlago commented Mar 7, 2025

Reason for this PR

Implementation of AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Using weight-only quantization and the configuration:

weight_bit_width: 3
weight_group_size: 128
weight_quant_granularity: per_group
weight_quant_type: asym
scaling_min_val: 0.00001
quantize_weight_zero_point: true
OPT-125M Llama3 1B
Float16 23.77 8.77
RTN 45.72 34.38
AWQ repo* 31.53 15.16
AWQ scale 33.97 19.39
AWQ clip 34.22 20.80
AWQ scale+clip 31.53 15.77

*Minor differences observed in perplexity between the original repository and Brevitas are due to order of operations/differences in quantizers.

Changes Made in this PR

  • Created a dataclass RegionAWQ , inheriting from Region to aggregate the information of the modules s on which AWQ optimizes the scale.
  • Adapted auto_scale and auto_clip to rely on Brevitas quantizers.

Testing Summary

Testing apply_awq against the author's repository.

Risk Highlight

  • This PR includes code from another work (please detail).
  • This PR contains API-breaking changes.
  • This PR depends on work in another PR (please provide links/details).
  • This PR introduces new dependencies (please detail).
  • There are coverage gaps not covered by tests.
  • Documentation updates required in subsequent PR.

Checklist

  • Code comments added to any hard-to-understand areas, if applicable.
  • Changes generate no new warnings.
  • Updated any relevant tests, if applicable.
  • No conflicts with destination dev branch.
  • I reviewed my own code changes.
  • Initial CI/CD passing.
  • 1+ reviews given, and any review issues addressed and approved.
  • Post-review full CI/CD passing.

@pablomlago pablomlago marked this pull request as ready for review March 10, 2025 12:09
@pablomlago pablomlago requested a review from Giuseppe5 March 10, 2025 12:09
@pablomlago pablomlago changed the title [DRAFT] Feat (llm/awq): activation-aware weight scaling Feat (llm/awq): activation-aware weight scaling Mar 10, 2025
@@ -780,9 +781,11 @@ def _no_equalize():
for module in chain(src_axes.values(), sink_axes.values()):
rewriters.extend(module.instantiate_rewriters(rewriter_class, scaling_factors))

# Apply rewriters before offloading
# Apply rewriters before offloading, if parametrize_inplace is True. Note that parametrizations
# are not immediately to prevent potential errors if the model is offloaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more the issue here?

raise ValueError # early exit to break later inference

# patch layer 0 to catch input and kwargs
layers[0] = Catcher(layers[0])
blocks[0] = Catcher(blocks[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this part of the codebase, why can't we do what we do in GPTQ to catch the input to the first block?
We can also move that piece of code to some utils in exmples/common/generative

@@ -64,3 +65,30 @@ def run(*args, **kwargs):
return function(*args, **kwargs)

return run


def longest_common_prefix(strings: List[str]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly specific to AWQ, not sure if this should live here

"ffn.act": block.ffn.act,
"ffn.down_proj": block.ffn.down_proj,},
))
elif "falcon" in str(block.__class__).lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Llama for now

@pablomlago pablomlago requested a review from Giuseppe5 March 12, 2025 09:44
@nickfraser nickfraser added the next release PRs which should be merged for the next release label Mar 20, 2025
@@ -370,6 +370,18 @@ def create_llm_args_parser():
default=[],
nargs='*',
help='A list of module names to expand with hadamard rotation. Default: %(default)s')
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readme

@@ -251,6 +287,65 @@ def apply(self, model, is_training, quantization_enabled):
self.enable_param_quantization(model, is_training)


class disable_enable_quantization:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another class that does this as well, not in a context manager fashion.

I think we might consider just switching to this new class everywhere?
The main consideration is that we need handle disabling quantization for activation calibration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle it in a separate PR.

for r in rewriters:
model = r.apply(model)
if parametrize_inplace or not isinstance(r, ModuleInstanceRegisterParametrization):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. The comment above doesn't address the parametrize_inplace flag and how the two combines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a leftover. I've removed it.

model=model,
tokenizer=tokenizer,
args=args,
n_samples=128,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be n_samples=args.n_samples and seqlen=args.seqlen instead of hard-coded to 128 and 512, respectively?

@pablomlago pablomlago force-pushed the feat-llm-awq branch 2 times, most recently from 170b873 to 8d65b34 Compare April 23, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next release PRs which should be merged for the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants