Skip to content

Commit 34bc370

Browse files
bbhattarUbuntulxningmreso
authored
IPEX LLM serving example (#3068)
* adding the files for ipex int8 serving of llms * Update README.md Fixed some markdowns * Fix handler name * Adding default PyTorch support * Fixing some issues with handler, added test to verify smooth-quant * adding auto_mixed_precision flag to config * Removing min_new_tokens from generation config * fix lint * lint * lint * Fixing unit tests with different model that doesn't require license * Fix lint error * Fix lint error in test * Adding requirements.txt * adding datasets to the requirements * upgrading the ipex version to 2.3.0 to match that of pytorch * Skipping ipex llm tests if accelerate is not present --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: lxning <[email protected]> Co-authored-by: lxning <[email protected]> Co-authored-by: Matthias Reso <[email protected]>
1 parent d9fbb19 commit 34bc370

11 files changed

+1202
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Serving IPEX Optimized Models
2+
This example provides an example of serving IPEX-optimized LLMs e.g. ```meta-llama/llama2-7b-hf``` on huggingface. For setting up the Python environment for this example, please refer here: https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/README.md#3-environment-setup
3+
4+
5+
1. Run the model archiver
6+
```
7+
torch-model-archiver --model-name llama2-7b --version 1.0 --handler llm_handler.py --config-file llama2-7b-int8-woq-config.yaml --archive-format no-archive
8+
```
9+
10+
2. Move the model inside model_store
11+
```
12+
mkdir model_store
13+
mv llama2-7b ./model_store
14+
```
15+
16+
3. Start the torch server
17+
```
18+
torchserve --ncs --start --model-store model_store models llama2-7b
19+
```
20+
21+
5. Test the model status
22+
```
23+
curl http://localhost:8081/models/llama2-7b
24+
```
25+
26+
6. Send the request
27+
```
28+
curl http://localhost:8080/predictions/llama2-7b -T ./sample_text_0.txt
29+
```
30+
## Model Config
31+
In addition to usual torchserve configurations, you need to enable ipex specific optimization arguments.
32+
33+
In order to enable IPEX, ```ipex_enable=true``` in the ```config.parameters``` file. If not enabled it will run with default PyTorch with ```auto_mixed_precision``` if enabled. In order to enable ```auto_mixed_precision```, you need to set ```auto_mixed_precision: true``` in model-config file.
34+
35+
You can choose either Weight-only Quantization or Smoothquant path for quantizing the model to ```INT8```. If the ```quant_with_amp``` flag is set to ```true```, it'll use a mix of ```INT8``` and ```bfloat16``` precisions, otherwise, it'll use ```INT8``` and ```FP32``` combination. If neither approaches are enabled, the model runs on ```bfloat16``` precision by default as long as ```quant_with_amp``` or ```auto_mixed_precision``` is set to ```true```.
36+
37+
There are 3 different example config files; ```model-config-llama2-7b-int8-sq.yaml``` for quantizing with smooth-quant, ```model-config-llama2-7b-int8-woq.yaml``` for quantizing with weight only quantization, and ```model-config-llama2-7b-bf16.yaml``` for running the text generation on bfloat16 precision.
38+
39+
### IPEX Weight Only Quantization
40+
<ul>
41+
<li> weight_type: weight data type for weight only quantization. Options: INT8 or INT4.
42+
<li> lowp_mode: low precision mode for weight only quantization. It indicates data type for computation.
43+
</ul>
44+
45+
### IPEX Smooth Quantization
46+
47+
<ul>
48+
<li> calibration_dataset, and calibration split: dataset and split to be used for calibrating the model quantization
49+
<li> num_calibration_iters: number of calibration iterations
50+
<li> alpha: a floating point number between 0.0 and 1.0. For more complex smoothquant config, explore IPEX quantization recipes ( https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_quantization.py )
51+
</ul>
52+
53+
Set ```greedy``` to true if you want to perform greedy search decoding. If set false, beam search of size 4 is performed by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ipex_enable=true
2+
cpu_launcher_enable=true
3+
cpu_launcher_args=--node_id 0

0 commit comments

Comments
 (0)