|
| 1 | +# Serving IPEX Optimized Models |
| 2 | +This example provides an example of serving IPEX-optimized LLMs e.g. ```meta-llama/llama2-7b-hf``` on huggingface. For setting up the Python environment for this example, please refer here: https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/README.md#3-environment-setup |
| 3 | + |
| 4 | + |
| 5 | +1. Run the model archiver |
| 6 | +``` |
| 7 | +torch-model-archiver --model-name llama2-7b --version 1.0 --handler llm_handler.py --config-file llama2-7b-int8-woq-config.yaml --archive-format no-archive |
| 8 | +``` |
| 9 | + |
| 10 | +2. Move the model inside model_store |
| 11 | +``` |
| 12 | +mkdir model_store |
| 13 | +mv llama2-7b ./model_store |
| 14 | +``` |
| 15 | + |
| 16 | +3. Start the torch server |
| 17 | +``` |
| 18 | +torchserve --ncs --start --model-store model_store models llama2-7b |
| 19 | +``` |
| 20 | + |
| 21 | +5. Test the model status |
| 22 | +``` |
| 23 | +curl http://localhost:8081/models/llama2-7b |
| 24 | +``` |
| 25 | + |
| 26 | +6. Send the request |
| 27 | +``` |
| 28 | +curl http://localhost:8080/predictions/llama2-7b -T ./sample_text_0.txt |
| 29 | +``` |
| 30 | +## Model Config |
| 31 | +In addition to usual torchserve configurations, you need to enable ipex specific optimization arguments. |
| 32 | + |
| 33 | +In order to enable IPEX, ```ipex_enable=true``` in the ```config.parameters``` file. If not enabled it will run with default PyTorch with ```auto_mixed_precision``` if enabled. In order to enable ```auto_mixed_precision```, you need to set ```auto_mixed_precision: true``` in model-config file. |
| 34 | + |
| 35 | +You can choose either Weight-only Quantization or Smoothquant path for quantizing the model to ```INT8```. If the ```quant_with_amp``` flag is set to ```true```, it'll use a mix of ```INT8``` and ```bfloat16``` precisions, otherwise, it'll use ```INT8``` and ```FP32``` combination. If neither approaches are enabled, the model runs on ```bfloat16``` precision by default as long as ```quant_with_amp``` or ```auto_mixed_precision``` is set to ```true```. |
| 36 | + |
| 37 | +There are 3 different example config files; ```model-config-llama2-7b-int8-sq.yaml``` for quantizing with smooth-quant, ```model-config-llama2-7b-int8-woq.yaml``` for quantizing with weight only quantization, and ```model-config-llama2-7b-bf16.yaml``` for running the text generation on bfloat16 precision. |
| 38 | + |
| 39 | +### IPEX Weight Only Quantization |
| 40 | +<ul> |
| 41 | + <li> weight_type: weight data type for weight only quantization. Options: INT8 or INT4. |
| 42 | + <li> lowp_mode: low precision mode for weight only quantization. It indicates data type for computation. |
| 43 | +</ul> |
| 44 | + |
| 45 | +### IPEX Smooth Quantization |
| 46 | + |
| 47 | +<ul> |
| 48 | + <li> calibration_dataset, and calibration split: dataset and split to be used for calibrating the model quantization |
| 49 | + <li> num_calibration_iters: number of calibration iterations |
| 50 | + <li> alpha: a floating point number between 0.0 and 1.0. For more complex smoothquant config, explore IPEX quantization recipes ( https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_quantization.py ) |
| 51 | +</ul> |
| 52 | + |
| 53 | +Set ```greedy``` to true if you want to perform greedy search decoding. If set false, beam search of size 4 is performed by default. |
0 commit comments