Skip to content

GPTQ updates #2235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 2, 2025
Merged

GPTQ updates #2235

merged 7 commits into from
Jun 2, 2025

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented May 21, 2025

Summary:

  1. reorganized GPTQ
    a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ
    b) moved new GPTQ to prototype
    c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py
  2. removed dependence on lm_eval for input_recorder
    a) created new input recorder that doesn't depend on lm_eval
    b) made lm_eval input recorder depend on new generic input_recorder
    c) made TransformerEvalWrapper the base class and made LMEvalInputRecorder inherit from it (rather than vice versa like before)
    d) updated apis generally to work with new input recorder (inputs have to be passed in as though they were being passed into the model, so input_recorder(*args) rather than input_recorder(args)
  3. reorganized GPTQ tests
    a) moved tests from test_quant_api.py to test_gptq.py
    b) added new test that can be run in CI that doesn't depend on
    c) removed all the 8aw4 tests that we never got working (is this fine?)
    lm_eval/llama weights
    c) got rid of test_gptq_mt.py (consolidated into test_gptq where relevant.
  4. added new documentation for lm_eval
    a) new readme and eval benchmarks for GPTQ
    b) comments in GPTQ.py
  5. GPTQ improvements
    a) tested compilation of hessian calculation and parts of faster quant,
    generally they were slower or buggy. Possible to get a speedup but it was inconsistent so removed it.
    b) reimplemented faster quant while trying to compile it(improved speed by 2-5% and code is clearer while trying to compile pasrts)
    c) moved helper functions out of the class. They're largely generic and
    this is less cluttered. May need to revisit how generic they are if new GPTQQuantizers are made.
    d) some improvements to the duplication checking and copying to be
    faster when possible (previously MultiTensor.unpad had checks but at the point that its called, we already did equality checks so we don't need them in unpad.
    e) fixed some bugs due to this not being in CI and things changing for
    int4wo tensor subclass.
  6. BC
    a) got rid of Int8DynActInt4WeightGPTQQuantizer since it was unused. Can re-add if desired.
    b) for other imports, maintained BC, previous impoirts from quantization/GPTQ.py now go to quantization/GPTQ/init.py
    c) InputRecorder -> LMEvalInputRecorder but left BC import in as an option.

Test Plan:

  1. python test_gptq.py

note1: the skipped test test_gptq_quantizer_int4_weight_only also ran.
note2: we now have a CI ready test in test_gptq using the generic input recorder.

  1. I verified that all activation match between old GPTQ (non MT) and current
    GPTQ, this can be seen by passing the test_gptq_quantizer_int4_weight_only as mentioned above but was also verified by comparing debug outputs and printing activation values for first 3 multi tensors.

  2. eval benchmarks:

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

export MODEL_REPO=meta-llama/Meta-Llama-3-8B
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

see README.md for results but they show GPTQ is working

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented May 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2235

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5bab3ab with merge base c4250a4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2025
@HDCharles HDCharles added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label May 21, 2025
@HDCharles HDCharles force-pushed the 098_gptq branch 11 times, most recently from 956e16c to 4b44d67 Compare May 23, 2025 13:41
Summary:

1) reorganized GPTQ
 a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ
 b) moved new GPTQ to prototype
 c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py
2) removed dependence on lm_eval for input_recorder
 a) created new input recorder that doesn't depend on lm_eval
 b) made lm_eval input recorder depend on new generic input_recorder
 c) made TransformerEvalWrapper the base class and made
 d) updated apis generally to work with new input recorder
 LMEvalInputRecorder inherit from it instead of vice-versa
3) reorganized GPTQ tests
 a) moved tests from test_quant_api.py to test_gptq.py
 b) added new test that can be run in CI that doesn't depend on
 lm_eval/llama weights
 c) got rid of test_gptq_mt.py
4) added new documentation for lm_eval
5) GPTQ improvements
 a) reimplemented faster quant
 b) tested compilation of hessian calculation and parts of faster quant,
 generally they were slower.
 c) moved helper functions out of the class. They're largely generic and
 this is less cluttered.
 d) some improvements to the duplication checking and copying to be
 faster when possible
 e) fixed some bugs due to this not being in CI and things changing for
 int4wo tensor subclass.

Test Plan:

1) `python test_gptq.py`

note: the skipped test test_gptq_quantizer_int4_weight_only also ran.

2) I verified that all activation match between old GPTQ and current
   GPTQ

3)

```shell

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

export MODEL_REPO=meta-llama/Meta-Llama-3-8B
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-64
python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth
--quantization int4wo-gptq-64 --calibration_limit 10

```
see README.md for results but they show GPTQ is working

Reviewers:

Subscribers:

Tasks:

Tags:
HDCharles added 2 commits May 30, 2025 14:13
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Comment on lines +169 to +173
def get_recorded_inputs(self):
return self.base_input_recorder.get_recorded_inputs()

def get_recorded_args_and_kwargs(self):
return self.base_input_recorder.get_recorded_args_and_kwargs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_recorded_args(self) might be more consistent with get_recorded_args_and_kwargs?

does recorded inputs support recording multiple args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to maintain the old API which is used in a few places just to get data.

# first gather inputs
input_recorder = MultiTensorInputRecorder()
for i in range(calibration_limit):
args = get_next_input() # user provided function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: might be useful to have some real data example (e.g.

dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
), but I'm also planning to take a look at this soon

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think the updated API looks good

# note: can do input_recorder(*args, **kwargs) if needed

# then perform GPTQ
quantizer = Int4WeightOnlyGPTQQuantizer() # quantization parameters like group_size can be set here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe adding some example on how people can adapt this to their own quantization might be useful as well I think

"Int4WeightOnlyQuantizer",
"Int8DynActInt4WeightGPTQQuantizer",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still used in ET (code): https://github.com/pytorch/executorch/blob/66dfc46686ebcb317efdf13580e869e7e4e2f0cc/examples/models/llama/source_transformation/quantize.py#L184 so probably better to remove the this in ET before landing the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDCharles added 4 commits May 30, 2025 17:44
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
HDCharles added a commit to HDCharles/executorch that referenced this pull request May 31, 2025
This is being deprecated in torchao pytorch/ao#2235
@HDCharles HDCharles merged commit f0f1f6c into main Jun 2, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants