Skip to content

Latest commit

 

History

History

README.md

Ryzen™ AI Quantization Tutorial

Ryzen AI Model Quantization and Deployment

Welcome to the Ryzen AI Model Quantization tutorial. In this guide, you'll learn how to quantize your deep learning models for efficient deployment on NPU.

AMD Quark is a comprehensive cross-platform deep learning toolkit designed to simplify and enhance the quantization of deep learning models. Quark enables developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.

Quark helps restore model accuracy that may be lost after quantization through various post training quantization techniques.

For more details refer to the AMD Quark official documentation

Quantization Tutorial

In this tutorial we will provide step by step guide for quantization of CNN models using AMD Quark quantization API. Quark for ONNX leverages the power of the ONNX Runtime Quantization tool, providing a robust and flexible solution for quantizing ONNX models. We will demonstrate how to quantize and run the ResNet50 model ONNX Runtime on Ryzen AI PCs.

Setup Environment

Create a clone of the Ryzen AI installation conda environment to add required python packages

Create and Activate Conda Environment

set RYZEN_AI_CONDA_ENV_NAME=ryzen-ai-<version>s
conda create --name quark_quantization --clone %RYZEN_AI_CONDA_ENV_NAME%
conda activate quark_quantization

Add python packages needed for the tutorial:

cd ryzenai-sw\tutorial\quark_quantization
pip install -r requirements.txt

Input model

Download the pre-trained float ONNX models for ResNet50

The downloaded model should be placed in the models folder, and the script quark_quantize.py will use it as the input model for quantization as shown below.

input_model_path = "models\\resnet50.onnx"  
output_model_path = "models\\resnet50_quant.onnx" 

Alternatively you can use the python script to download the pre-trained ResNet50 model

cd models
python download_ResNet.py 

ImageNet Dataset

If you already have an ImageNet datasets, you can directly use your dataset path.

To prepare the test data, please check the download section of the main website: https://huggingface.co/datasets/ILSVRC/imagenet-1k/tree/script/data. You need to register and download val_images.tar.gz.

Then, create the validation dataset and calibration dataset:

mkdir val_data && tar -xzf val_images.tar.gz -C val_data
python prepare_data.py val_data calib_data

Quantization configuration

Users can apply their own Customer Settings for the QuantizationConfig, CalibrationMethod, or use the some of predefined configurations XINT8 or INT8_CNN_DEFAULT or INT16_CNN_DEFAULT_CONFIG they can refer to the Quark's User Guide for help.

from quark.onnx.quantization.config import Config, get_default_config
from quark.onnx.quantization.config.config import QuantizationConfig
from onnxruntime.quantization.calibrate import CalibrationMethod
from onnxruntime.quantization.quant_utils import QuantType, QuantFormat

# Use default quantization configuration
quant_config = get_default_config("XINT8")

# Define custom quantization configuration
# quant_config = QuantizationConfig(calibrate_method=PowerOfTwoMethod.MinMSE,
#                                   activation_type=QuantType.QUInt8,
#                                   weight_type=QuantType.QInt8,
#                                   enable_npu_cnn=True,
#                                   extra_options={'ActivationSymmetric': True})

# Defines the quantization configuration for the whole model
config = Config(global_quant_config=quant_config)
print("The configuration of the quantization is {}".format(config))

Calibration Dataset

The calibration dataset is used to compute the quantization parameters for the activations. Use the ImageDataReader class to setup the calibration dataset

from utils import ImageDataReader

num_calib_data = 1000
calibration_dataset = ImageDataReader(calibration_dataset_path, input_model_path, data_size=num_calib_data, batch_size=1)

Model Quantization

The model quantizer save the quantized model model at output_model_path

from quark.onnx import ModelQuantizer
# Create an ONNX Quantizer
quantizer = ModelQuantizer(config)

# Quantize the ONNX model
quant_model = quantizer.quantize_model(model_input = input_model_path, 
                                       model_output = output_model_path, 
                                       calibration_data_path = None)

Model Size

Print the original and quantized models.

print("Model Size:")
print("Float32 model size: {:.2f} MB".format(os.path.getsize(input_model_path)/(1024 * 1024)))
print("Int8 quantized model size: {:.2f} MB".format(os.path.getsize(output_model_path)/(1024 * 1024)))

Model Evaluation

Print the original and quantized models accuracy

from utils import evaluate_onnx_model  

print("Model Accuracy:")
top1_acc, top5_acc = evaluate_onnx_model(input_model_path, imagenet_data_path='calib_data')
print("Float32 model accuracy: Top1 {:.3f}, Top5 {:.3f} ".format(top1_acc, top5_acc))
top1_acc, top5_acc = evaluate_onnx_model(output_model_path, imagenet_data_path='calib_data')
print("Int8 quantized model accuracy: Top1 {:.3f}, Top5 {:.3f} ".format(top1_acc, top5_acc))
top1_acc, top5_acc = evaluate_onnx_model(output_model_path, imagenet_data_path='calib_data', device='npu')
print("Int8 quantized model accuracy (NPU): Top1 {:.3f}, Top5 {:.3f} ".format(top1_acc, top5_acc))

Accuracy Summary

Quantize and Evaluation the top1 and top5 accuracy of the models on ImageNet validation dataset using the quark_quantize.py script.

python quark_quantize.py --quantize --evaluate

ResNet50: Using XINT8 configuration

ResNet50 Model Size Top-1 Accuracy Top-5 Accuracy
Float 32 97.41 MB 80.0% 96.1%
INT8 (CPU) 24.46 MB 78.1% 95.6%
INT8 (NPU) 24.46 MB 77.4% 95.2%

Run model on CPU

quant_model = onnx.load(output_model_path)
provider = ['CPUExecutionProvider']

session = ort.InferenceSession(quant_model.SerializeToString(), providers=provider)
input_data = preprocess_image('test_image.jpg')
# run the model in onnxruntime session
outputs = session.run(None, {session.get_inputs()[0].name: input_data})
predicted_class = np.argmax(outputs[0])
print('CPU quantized outputs:')
print(predicted_class)

Run model on NPU

quant_model = onnx.load(output_model_path)
provider = ['VitisAIExecutionProvider']
cache_dir = Path(__file__).parent.resolve()
print(cache_dir)
provider_options = [{
                'config_file': 'vaip_config.json',
                'cacheDir': str(cache_dir),
                'cacheKey': 'modelcachekey'
            }]

session = ort.InferenceSession(quant_model.SerializeToString(), providers=provider,
                               provider_options=provider_options)

input_data = preprocess_image('test_image.jpg')
outputs = session.run(None, {session.get_inputs()[0].name: input_data})
predicted_class = np.argmax(outputs[0])
print('NPU quantized outputs:')
print(predicted_class)

Inference Performance

Evaluate the inference performance of the float and quantized models on CPU and NPU

python quark_quantize.py --quantize --evaluate --benchmark
ResNet50 Model Size Top-1 Accuracy Top-5 Accuracy Inference Time
Float 32 97.41 MB 80.0% 96.1% 10.70 ms
INT8 (CPU) 24.46 MB 78.1% 95.6% 34.45 ms
INT8 (NPU) 24.46 MB 77.4% 95.2% 5.74 ms

Advancted Quantization Tools

While the default quantization configurations work well for many popular models, more sophisticated models might experience a decline in accuracy due to errors introduced during the quantization process. To address this, Quark APIs offer advanced tools to help recover lost accuracy. Some of these tools are highlighted in the Advanced Quantization Tools tutorial.