This recipe demonstrates how to use Olive to perform mixed precision (INT4/INT4) quantization, export to ONNX, and evaluate using lm-evaluation-harness. Please refer to the Exploring Optimal Quantization Settings for Small Language Models with Olive for more details.
Install Olive and other dependencies:
pip install -r requirements.txtTo run the mixed precision quantization recipe, execute the following command:
olive run --config mixed.jsonNote: Evaluation requires a machine with CUDA enabled GPU. If you don't have a GPU, you can skip the evaluation step by modifying the mixed.json file to remove the "evaluator": "evaluator" line.
| model | arc_challenge | arc_easy | mmlu | hellaswag | mmlu_stem | openbookqa | model_size_gb |
|---|---|---|---|---|---|---|---|
| Original (fp16) | 0.465 | 0.760 | 0.601 | 0.683 | 0.539 | 0.404 | 3.318 |
| Mixed | 0.487 | 0.772 | 0.592 | 0.670 | 0.533 | 0.410 | 1.479 |