This recipe demonstrates how to use Olive to perform mixed precision (INT4/INT4) quantization, export to ONNX, and evaluate using lm-evaluation-harness. Please refer to the Exploring Optimal Quantization Settings for Small Language Models with Olive for more details.
Install Olive and other dependencies:
pip install -r requirements.txtTo run the mixed precision quantization recipe, execute the following command:
olive run --config mixed.jsonTo run the mixed precision quantization with embedding quantization and weight tying, execute the following command:
olive run --config mixed-tied.jsonNote: Evaluation requires a machine with CUDA enabled GPU. If you don't have a GPU, you can skip the evaluation step by modifying the config json file to remove the "evaluator": "evaluator" line.
| model | arc_challenge | arc_easy | mmlu | hellaswag | mmlu_stem | openbookqa | model_size_gb |
|---|---|---|---|---|---|---|---|
| Original (fp16) | 0.585 | 0.803 | 0.669 | 0.728 | 0.598 | 0.426 | 8.314 |
| Mixed | 0.593 | 0.801 | 0.664 | 0.721 | 0.592 | 0.424 | 3.844 |
| Mixed Tied | 0.578 | 0.806 | 0.649 | 0.721 | 0.594 | 0.426 | 3.285 |