-
Notifications
You must be signed in to change notification settings - Fork 273
GPTQ updates #2235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ updates #2235
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2235
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5bab3ab with merge base c4250a4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
956e16c
to
4b44d67
Compare
Summary: 1) reorganized GPTQ a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ b) moved new GPTQ to prototype c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py 2) removed dependence on lm_eval for input_recorder a) created new input recorder that doesn't depend on lm_eval b) made lm_eval input recorder depend on new generic input_recorder c) made TransformerEvalWrapper the base class and made d) updated apis generally to work with new input recorder LMEvalInputRecorder inherit from it instead of vice-versa 3) reorganized GPTQ tests a) moved tests from test_quant_api.py to test_gptq.py b) added new test that can be run in CI that doesn't depend on lm_eval/llama weights c) got rid of test_gptq_mt.py 4) added new documentation for lm_eval 5) GPTQ improvements a) reimplemented faster quant b) tested compilation of hessian calculation and parts of faster quant, generally they were slower. c) moved helper functions out of the class. They're largely generic and this is less cluttered. d) some improvements to the duplication checking and copying to be faster when possible e) fixed some bugs due to this not being in CI and things changing for int4wo tensor subclass. Test Plan: 1) `python test_gptq.py` note: the skipped test test_gptq_quantizer_int4_weight_only also ran. 2) I verified that all activation match between old GPTQ and current GPTQ 3) ```shell export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization int4wo-64 python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization int4wo-gptq-64 --calibration_limit 10 export MODEL_REPO=meta-llama/Meta-Llama-3-8B python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization int4wo-64 python eval.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization int4wo-gptq-64 --calibration_limit 10 ``` see README.md for results but they show GPTQ is working Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
def get_recorded_inputs(self): | ||
return self.base_input_recorder.get_recorded_inputs() | ||
|
||
def get_recorded_args_and_kwargs(self): | ||
return self.base_input_recorder.get_recorded_args_and_kwargs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get_recorded_args(self) might be more consistent with get_recorded_args_and_kwargs?
does recorded inputs support recording multiple args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to maintain the old API which is used in a few places just to get data.
# first gather inputs | ||
input_recorder = MultiTensorInputRecorder() | ||
for i in range(calibration_limit): | ||
args = get_next_input() # user provided function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: might be useful to have some real data example (e.g.
ao/torchao/prototype/awq/example.py
Line 21 in c4250a4
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think the updated API looks good
# note: can do input_recorder(*args, **kwargs) if needed | ||
|
||
# then perform GPTQ | ||
quantizer = Int4WeightOnlyGPTQQuantizer() # quantization parameters like group_size can be set here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe adding some example on how people can adapt this to their own quantization might be useful as well I think
"Int4WeightOnlyQuantizer", | ||
"Int8DynActInt4WeightGPTQQuantizer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still used in ET (code): https://github.com/pytorch/executorch/blob/66dfc46686ebcb317efdf13580e869e7e4e2f0cc/examples/models/llama/source_transformation/quantize.py#L184 so probably better to remove the this in ET before landing the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
This is being deprecated in torchao pytorch/ao#2235
Summary:
a) got rid of old GPTQ and renamed GPTQ_MT to GPTQ
b) moved new GPTQ to prototype
c) moved quantized linear modules in GPTQ.py to linear_quant_modules.py
a) created new input recorder that doesn't depend on lm_eval
b) made lm_eval input recorder depend on new generic input_recorder
c) made TransformerEvalWrapper the base class and made LMEvalInputRecorder inherit from it (rather than vice versa like before)
d) updated apis generally to work with new input recorder (inputs have to be passed in as though they were being passed into the model, so input_recorder(*args) rather than input_recorder(args)
a) moved tests from test_quant_api.py to test_gptq.py
b) added new test that can be run in CI that doesn't depend on
c) removed all the 8aw4 tests that we never got working (is this fine?)
lm_eval/llama weights
c) got rid of test_gptq_mt.py (consolidated into test_gptq where relevant.
a) new readme and eval benchmarks for GPTQ
b) comments in GPTQ.py
a) tested compilation of hessian calculation and parts of faster quant,
generally they were slower or buggy. Possible to get a speedup but it was inconsistent so removed it.
b) reimplemented faster quant while trying to compile it(improved speed by 2-5% and code is clearer while trying to compile pasrts)
c) moved helper functions out of the class. They're largely generic and
this is less cluttered. May need to revisit how generic they are if new GPTQQuantizers are made.
d) some improvements to the duplication checking and copying to be
faster when possible (previously MultiTensor.unpad had checks but at the point that its called, we already did equality checks so we don't need them in unpad.
e) fixed some bugs due to this not being in CI and things changing for
int4wo tensor subclass.
a) got rid of Int8DynActInt4WeightGPTQQuantizer since it was unused. Can re-add if desired.
b) for other imports, maintained BC, previous impoirts from quantization/GPTQ.py now go to quantization/GPTQ/init.py
c) InputRecorder -> LMEvalInputRecorder but left BC import in as an option.
Test Plan:
python test_gptq.py
note1: the skipped test test_gptq_quantizer_int4_weight_only also ran.
note2: we now have a CI ready test in test_gptq using the generic input recorder.
I verified that all activation match between old GPTQ (non MT) and current
GPTQ, this can be seen by passing the test_gptq_quantizer_int4_weight_only as mentioned above but was also verified by comparing debug outputs and printing activation values for first 3 multi tensors.
eval benchmarks:
see README.md for results but they show GPTQ is working
Reviewers:
Subscribers:
Tasks:
Tags: