Description
Dear @ChenMnZ,
Thank you for the paper and sharing the code.
I'm very interested about the idea. Especially encouraging is the prospect of reaching great accuracy using static per-channel / tensor quantization!
I evaluated PrefixQuant with a few HF models, and was able to reproduce the good results of your paper with Llama-2-7B
, Llama-3-8B
, Mistral-7B
(and a couple other models).
[I'm writing this to indicate that my settings are probably OK, and are not the cause for the issues below]
However, with smaller models (of sizes 0.5B - 3B), the results I'm getting are catastrophic (multi-digit perplexity numbers on WikiText2).
The models I tested that resulted in PPL disasters are:
Qwen2-0.5B
Llama-3.2-1B
Qwen2-1.5B-Instruct
Llama-3.2-3B-Instruct
A couple remarks:
Llama-3.2-xx
models are not supported by the oldertransformers
library version required by your repo (4.40.1).
I encountered minor compatibility issues when trying to run your repo with a more recenttransformers
version, but solved them by explicitly configuringuse_cache=True
. But even after solving them, the models would not properly quantize.- For
Qwen2-0.5B
model, there was an issue with creating the online Hadamard matrix fordown_proj
input (functionget_hadK()
does not support the intermediate feature size). I overcame it by disabling the configuration optiondown_online_had
, but it didn't help reach reasonable quantization accuracy. - As a sanity check, I tried to quantize one of the problematic models listed above with additional fine-tuning - this did not help, either.
Do you have any inputs on what can help SLM models quantize well?
Thanks in advance!