-
Notifications
You must be signed in to change notification settings - Fork 5
Quantization Guide
AOCL-DLP supports quantized GEMM operations for efficient inference workloads. This guide covers symmetric quantization, mixed-precision workflows, and how to configure scale factors and zero-points.
Quantization maps floating-point values to lower-precision integers for faster computation and smaller memory footprint.
Symmetric quantization centers the quantized range around zero:
q = round(x * scale)
x = q / scale
Asymmetric quantization uses a zero-point offset:
q = round(x * scale) - zero_point
x = (q + zero_point) / scale
For workloads where both inputs are already quantized:
| Input A | Input B | Accumulator | Outputs | Function pattern |
|---|---|---|---|---|
| u8 | s8 | s32 | s32, s8, u8, f32, bf16 | aocl_gemm_u8s8s32o* |
| s8 | s8 | s32 | s32, s8, u8, f32, bf16 | aocl_gemm_s8s8s32o* |
These variants compute the GEMM in integer arithmetic (using AVX512_VNNI when available) and can output in various types. Use SCALE and BIAS post-ops for dequantization.
#include <aocl_dlp.h>
// Quantized activations (uint8) and weights (int8)
uint8_t activations[M * K] = { /* quantized input */ };
int8_t weights[K * N] = { /* quantized weights */ };
int32_t output[M * N] = {0};
// Basic quantized GEMM (no post-ops, raw int32 accumulation)
aocl_gemm_u8s8s32os32(
'R', 'N', 'N', m, n, k,
1, // alpha (int32)
activations, lda, 'N',
weights, ldb, 'N',
0, // beta (int32)
output, ldc, NULL);To get float output from integer GEMM with dequantization:
// Scale factors for dequantization (one per output channel)
float scale_vals[N] = { /* calibrated scales */ };
dlp_sf_t sf = {
.scale_factor = scale_vals,
.scale_factor_len = n,
.scale_factor_type = DLP_F32,
.scale_factor_dim = DLP_PARAM_DIM_PER_CHANNEL // required for SCALE
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };
// Bias (applied after scaling)
float bias_vals[N] = { /* bias per channel */ };
dlp_post_op_bias bias_op = {
.bias = bias_vals, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};
// Chain: SCALE then BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, BIAS };
dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.scale = &scale_op;
meta.bias = &bias_op;
float output_f32[M * N];
aocl_gemm_u8s8s32of32(
'R', 'N', 'N', m, n, k,
1, activations, lda, 'N',
weights, ldb, 'N',
0, output_f32, ldc, &meta);
// output_f32 = bias + scale * (activations * weights)In W8A8 inference the serving stack pre-quantizes activations one row at a time: each output row (token) is quantized to int8 using its own scale, derived from that row's dynamic range. The library then runs the integer GEMM on s8 x s8 accumulating in s32, and the per-row dequantization is delivered as a SCALE post-op with one scale per output row (length m) and scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN.
This is the per-row counterpart of per-channel (length n) dequantization shown above -- the scale varies along m (rows) instead of n (columns):
| Granularity | scale_factor_dim |
scale_factor_len |
|---|---|---|
| Per-tensor | DLP_PARAM_DIM_PER_TENSOR |
1 |
| Per-channel | DLP_PARAM_DIM_PER_CHANNEL |
n (one per column) |
| Per-token (PerM) | DLP_PARAM_DIM_PER_TOKEN |
m (one per row) |
// Per-row dequantization scales recorded when activations were quantized
float a_dequant_sf[M] = { /* ... one value per output row */ };
dlp_sf_t sf = {
.scale_factor = a_dequant_sf,
.scale_factor_len = m, // one scale per row
.scale_factor_type = DLP_F32,
.scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN // per-token / PerM granularity
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };
DLP_POST_OP_TYPE seq[] = { SCALE };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.scale = &scale_op;
// s8 x s8 -> s32 accumulate, then per-token SCALE to f32
float output_f32[M * N];
aocl_gemm_s8s8s32of32(
'R', 'N', 'N', m, n, k,
1, activations_s8, lda, 'N',
weights_s8, ldb, 'N',
0, output_f32, ldc, &meta);
// output_f32[i][j] = a_dequant_sf[i] * (activations_s8 * weights_s8)[i][j]Per-token SCALE also covers the decoder hot path, where the GEMM degenerates to a GEMV with n = 1 (one output column, m rows): each of the m rows still receives its own dequantization scale rather than a single broadcast value.
See simple_gemm_per_token_quant.c for a complete, self-contained example covering both the general m x n case and the n = 1 decoder case.
AOCL-DLP provides specialized symmetric quantization variants that handle grouped quantization natively:
aocl_gemm_s8s8s32of32_sym_quantaocl_gemm_s8s8s32obf16_sym_quant
The reorder functions (aocl_get_reorder_buf_size_s8s8s32os32_sym_quant and aocl_reorder_s8s8s32os32_sym_quant) accept a DLP_SYMM_STAT_QUANT* parameter to pack quantization group metadata alongside the reordered matrix. The GEMM call itself uses the standard signature with dlp_metadata_t* as the last parameter.
// Symmetric quantization config
DLP_SYMM_STAT_QUANT symq = {
.group_size = 128 // quantization group size (e.g., 128 elements per group)
};
// Reorder weights with symmetric quantization metadata
msz_t buf_size = aocl_get_reorder_buf_size_s8s8s32os32_sym_quant(
'R', 'N', 'B', k, n, &symq, NULL);
int8_t *b_reordered = (int8_t *)malloc(buf_size);
aocl_reorder_s8s8s32os32_sym_quant(
'R', 'N', 'B', weights, b_reordered, k, n, ldb, &symq, NULL);
// Compute with symmetric quantization
float output_f32[M * N];
aocl_gemm_s8s8s32of32_sym_quant(
'R', 'N', 'N', m, n, k,
1, activations_s8, lda, 'N',
b_reordered, ldb, 'R',
0, output_f32, ldc, NULL);For workloads where activations are in higher precision and weights are quantized:
| Input A (activations) | Input B (weights) | Accumulator | Outputs | Use case |
|---|---|---|---|---|
| bf16 | s8 | s32 | s32, f32, bf16, s8, u8 | BF16 activations with int8 weights |
| bf16 | s4 | f32 | f32, bf16 | BF16 activations with 4-bit weights |
| bf16 | u4 | f32 | f32, bf16 | BF16 activations with unsigned 4-bit weights |
| f32 | s8 | s32 | s32, f32, bf16, s8, u8 | F32 activations with int8 weights |
These variants handle on-the-fly quantization of the higher-precision input internally.
For advanced quantization workflows, dlp_metadata_t supports pre- and post-quantization operations via the dlp_quant_op struct:
typedef struct {
md_t group_size; // elements per quantization group
DLP_TYPE src_type; // source type (e.g., DLP_BF16)
DLP_TYPE dst_type; // destination type (e.g., DLP_S8)
dlp_sf_t* scl; // scale factors
dlp_zp_t* zp; // zero-points (NULL for symmetric)
bool symmetric; // true = symmetric, false = asymmetric
} dlp_quant_op;These can be attached to dlp_metadata_t as:
-
a_pre_quant/b_pre_quant-- quantize inputs before GEMM -
a_post_quant/b_post_quant-- quantize after GEMM
- Calibrate scales carefully -- Scale factors significantly impact accuracy. Use representative calibration data.
- Validate against float baselines -- Compare quantized output against f32 GEMM to verify acceptable accuracy loss.
- Use per-channel quantization for better accuracy at minimal performance cost compared to per-tensor.
-
Use per-token (PerM) scales for activations in W8A8 inference -- set
scale_factor_dim = DLP_PARAM_DIM_PER_TOKENwithscale_factor_len = mon the SCALE post-op (one scale per row/token). -
Reorder quantized weights -- Pre-reorder weights for repeated inference calls using
aocl_reorder_*functions. -
Choose output type wisely -- Writing quantized output (
os8,ou8) avoids a separate requantization pass.
- GEMM Guide -- All GEMM variants and parameter details
- Post-Ops Guide -- SCALE and BIAS post-ops for dequantization
-
Examples --
quantization.c,simple_gemm_s8.c - API Reference -- Generated API docs
Getting Started
User Guides
- Library Overview
- GEMM Guide
- Batch GEMM Guide
- Post-Operations
- Eltwise Operations
- Quantization
- API Lifecycle
Performance & Config
Testing & Benchmarking
Developer Guides
Reference