Skip to content
This repository was archived by the owner on Aug 30, 2024. It is now read-only.

Commit 150e752

Browse files
authored
[GPTQ Enhence] Refined Doc & Fixed GPTQ & AWQ issues. (#140)
1 parent 5293ffa commit 150e752

File tree

5 files changed

+64
-1
lines changed

5 files changed

+64
-1
lines changed

docs/gptq_and_awq.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
GPTQ & AWQ
2+
=======
3+
4+
Neural Speed supports multiple weight-only quantization algorithms, such as GPTQ and AWQ.
5+
6+
More algorithm details please check [GPTQ](https://arxiv.org/abs/2210.17323) and [AWQ](https://arxiv.org/abs/2306.00978).
7+
8+
Validated GPTQ & AWQ models directly from the HuggingFace:
9+
* [Llama-2-7B-Chat-GPT](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) & [Llama-2-13B-Chat-GPT](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
10+
* [CodeLlama-7B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ) & [CodeLlama-13B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ)
11+
* [SOLAR-10.7B-v1.0-GPTQ](https://huggingface.co/TheBloke/SOLAR-10.7B-v1.0-GPTQ)
12+
* [Llama-2-7B-AWQ](https://huggingface.co/TheBloke/Llama-2-7B-AWQ) & [Llama-2-13B-chat-AWQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-AWQ)
13+
* [CodeLlama-7B-AWQ](https://huggingface.co/TheBloke/CodeLlama-7B-AWQ) & [CodeLlama-13B-AWQ](https://huggingface.co/TheBloke/CodeLlama-13B-AWQ)
14+
15+
Please check more validated GPTQ & AWQ models in the list of [supported_models](./docs/supported_models.md).
16+
17+
## Examples
18+
19+
How to run GPTQ or AWQ models in Neural Speed:
20+
```python
21+
import sys
22+
from transformers import AutoTokenizer, TextStreamer
23+
from neural_speed import Model
24+
25+
if len(sys.argv) != 2:
26+
print("Usage: python python_api_example.py model_path")
27+
model_name = sys.argv[1]
28+
29+
prompt = "Once upon a time, a little girl"
30+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
31+
inputs = tokenizer(prompt, return_tensors="pt").input_ids
32+
streamer = TextStreamer(tokenizer)
33+
34+
model = Model()
35+
# Inference GPTQ models.
36+
model.init(model_name, weight_dtype="int4", compute_dtype="int8", use_gptq=True)
37+
# Inference AWQ models.
38+
# model.init(model_name, weight_dtype="int4", compute_dtype="int8", use_awq=True)
39+
40+
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, do_sample=True)
41+
```
42+
43+
Note: we have provided the [script](../scripts/python_api_example.py) to run these models.

docs/supported_models.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,16 @@ Neural Speed supports the following models:
4343
<td>✅</td>
4444
<td>✅</td>
4545
<td>Latest</td>
46+
</tr>
47+
<td><a href="https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf" target="_blank" rel="noopener noreferrer">CodeLlama-7b</a></td>
48+
<td>✅</td>
49+
<td>✅</td>
50+
<td>✅</td>
51+
<td>✅</td>
52+
<td>✅</td>
53+
<td>✅</td>
54+
<td>Latest</td>
55+
</tr>
4656
</tr>
4757
<td><a href="https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0" target="_blank" rel="noopener noreferrer">Solar-10.7B</a></td>
4858
<td>✅</td>
@@ -56,7 +66,7 @@ Neural Speed supports the following models:
5666
<tr>
5767
<td><a href="https://huggingface.co/EleutherAI/gpt-j-6b" target="_blank" rel="noopener noreferrer">GPT-J-6B</a></td>
5868
<td>✅</td>
59-
<td> </td>
69+
<td></td>
6070
<td> </td>
6171
<td>✅</td>
6272
<td> </td>

neural_speed/convert/convert_quantized_gptj.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,11 @@ def main(args_in: Optional[List[str]] = None) -> None:
146146
"rms_norm_eps", 1e-6))) # rms norm eps
147147
fout.write(struct.pack("f", 10000.0)) # freq_base
148148
fout.write(struct.pack("f", 1.0)) # rope_factor
149+
150+
fout.write(struct.pack("f", 0.0)) # config.json "rope_scaling.factor", not enabled
151+
fout.write(struct.pack("i", 0)) # rope_scaling.original_max_position_embeddings
152+
fout.write(struct.pack("i", 0)) # params["rope_scaling"]["type"] =="yarn" else 0))
153+
149154
fout.write(struct.pack("i", tokenizer.bos_token_id if tokenizer.bos_token_id is not None else 1))
150155
fout.write(struct.pack("i", tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 2))
151156
fout.write(struct.pack("i", tokenizer.pad_token_id if tokenizer.pad_token_id is not None else -1))

neural_speed/convert/convert_quantized_llama.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,10 @@ def main(args_in: Optional[List[str]] = None) -> None:
151151
f.write(struct.pack("f", config["rope_theta"] if "rope_theta" in config else 10000))
152152
f.write(struct.pack("f", rope_scale))
153153

154+
f.write(struct.pack("f", 0.0)) # config.json "rope_scaling.factor", not enabled
155+
f.write(struct.pack("i", 0)) # rope_scaling.original_max_position_embeddings
156+
f.write(struct.pack("i", 0)) # params["rope_scaling"]["type"] =="yarn" else 0))
157+
154158
# TODO, bos_token_id = 0 in https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/config.json
155159
# but bos_token_id = 1 in llama.cpp
156160
f.write(struct.pack("i", 1))

scripts/python_api_example.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,5 +28,6 @@
2828
streamer = TextStreamer(tokenizer)
2929

3030
model = Model()
31+
# If you want to run GPTQ or AWQ models, just set use_gptq = True or use_awq = True.
3132
model.init(model_name, weight_dtype="int4", compute_dtype="int8")
3233
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, do_sample=True)

0 commit comments

Comments
 (0)