Quantization Guide

AOCL-DLP supports quantized GEMM operations for efficient inference workloads. This guide covers symmetric quantization, mixed-precision workflows, and how to configure scale factors and zero-points.

Quantization Concepts

Quantization maps floating-point values to lower-precision integers for faster computation and smaller memory footprint.

Symmetric quantization centers the quantized range around zero:

q = round(x * scale)
x = q / scale

Asymmetric quantization uses a zero-point offset:

q = round(x * scale) - zero_point
x = (q + zero_point) / scale

Integer GEMM Variants

For workloads where both inputs are already quantized:

Input A	Input B	Accumulator	Outputs	Function pattern
u8	s8	s32	s32, s8, u8, f32, bf16	`aocl_gemm_u8s8s32o*`
s8	s8	s32	s32, s8, u8, f32, bf16	`aocl_gemm_s8s8s32o*`

These variants compute the GEMM in integer arithmetic (using AVX512_VNNI when available) and can output in various types. Use SCALE and BIAS post-ops for dequantization.

Example: Quantized Inference Layer

#include <aocl_dlp.h>

// Quantized activations (uint8) and weights (int8)
uint8_t activations[M * K] = { /* quantized input */ };
int8_t  weights[K * N]     = { /* quantized weights */ };
int32_t output[M * N]      = {0};

// Basic quantized GEMM (no post-ops, raw int32 accumulation)
aocl_gemm_u8s8s32os32(
    'R', 'N', 'N', m, n, k,
    1,                          // alpha (int32)
    activations, lda, 'N',
    weights, ldb, 'N',
    0,                          // beta (int32)
    output, ldc, NULL);

Dequantize Output with Post-Ops

To get float output from integer GEMM with dequantization:

// Scale factors for dequantization (one per output channel)
float scale_vals[N] = { /* calibrated scales */ };
dlp_sf_t sf = {
    .scale_factor      = scale_vals,
    .scale_factor_len  = n,
    .scale_factor_type = DLP_F32,
    .scale_factor_dim  = DLP_PARAM_DIM_PER_CHANNEL  // required for SCALE
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };

// Bias (applied after scaling)
float bias_vals[N] = { /* bias per channel */ };
dlp_post_op_bias bias_op = {
    .bias = bias_vals, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};

// Chain: SCALE then BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, BIAS };

dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.scale      = &scale_op;
meta.bias       = &bias_op;

float output_f32[M * N];
aocl_gemm_u8s8s32of32(
    'R', 'N', 'N', m, n, k,
    1, activations, lda, 'N',
    weights, ldb, 'N',
    0, output_f32, ldc, &meta);
// output_f32 = bias + scale * (activations * weights)

Per-Token (PerM) Dequantization

In W8A8 inference the serving stack pre-quantizes activations one row at a time: each output row (token) is quantized to int8 using its own scale, derived from that row's dynamic range. The library then runs the integer GEMM on s8 x s8 accumulating in s32, and the per-row dequantization is delivered as a SCALE post-op with one scale per output row (length m) and scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN.

This is the per-row counterpart of per-channel (length n) dequantization shown above -- the scale varies along m (rows) instead of n (columns):

Granularity	`scale_factor_dim`	`scale_factor_len`
Per-tensor	`DLP_PARAM_DIM_PER_TENSOR`	`1`
Per-channel	`DLP_PARAM_DIM_PER_CHANNEL`	`n` (one per column)
Per-token (PerM)	`DLP_PARAM_DIM_PER_TOKEN`	`m` (one per row)

// Per-row dequantization scales recorded when activations were quantized
float a_dequant_sf[M] = { /* ... one value per output row */ };

dlp_sf_t sf = {
    .scale_factor      = a_dequant_sf,
    .scale_factor_len  = m,                       // one scale per row
    .scale_factor_type = DLP_F32,
    .scale_factor_dim  = DLP_PARAM_DIM_PER_TOKEN  // per-token / PerM granularity
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };

DLP_POST_OP_TYPE seq[] = { SCALE };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.scale      = &scale_op;

// s8 x s8 -> s32 accumulate, then per-token SCALE to f32
float output_f32[M * N];
aocl_gemm_s8s8s32of32(
    'R', 'N', 'N', m, n, k,
    1, activations_s8, lda, 'N',
    weights_s8, ldb, 'N',
    0, output_f32, ldc, &meta);
// output_f32[i][j] = a_dequant_sf[i] * (activations_s8 * weights_s8)[i][j]

Per-token SCALE also covers the decoder hot path, where the GEMM degenerates to a GEMV with n = 1 (one output column, m rows): each of the m rows still receives its own dequantization scale rather than a single broadcast value.

See simple_gemm_per_token_quant.c for a complete, self-contained example covering both the general m x n case and the n = 1 decoder case.

Symmetric Quantization GEMM

AOCL-DLP provides specialized symmetric quantization variants that handle grouped quantization natively:

aocl_gemm_s8s8s32of32_sym_quant
aocl_gemm_s8s8s32obf16_sym_quant

The reorder functions (aocl_get_reorder_buf_size_s8s8s32os32_sym_quant and aocl_reorder_s8s8s32os32_sym_quant) accept a DLP_SYMM_STAT_QUANT* parameter to pack quantization group metadata alongside the reordered matrix. The GEMM call itself uses the standard signature with dlp_metadata_t* as the last parameter.

// Symmetric quantization config
DLP_SYMM_STAT_QUANT symq = {
    .group_size = 128   // quantization group size (e.g., 128 elements per group)
};

// Reorder weights with symmetric quantization metadata
msz_t buf_size = aocl_get_reorder_buf_size_s8s8s32os32_sym_quant(
    'R', 'N', 'B', k, n, &symq, NULL);

int8_t *b_reordered = (int8_t *)malloc(buf_size);
aocl_reorder_s8s8s32os32_sym_quant(
    'R', 'N', 'B', weights, b_reordered, k, n, ldb, &symq, NULL);

// Compute with symmetric quantization
float output_f32[M * N];
aocl_gemm_s8s8s32of32_sym_quant(
    'R', 'N', 'N', m, n, k,
    1, activations_s8, lda, 'N',
    b_reordered, ldb, 'R',
    0, output_f32, ldc, NULL);

Mixed-Precision Quantized GEMM

For workloads where activations are in higher precision and weights are quantized:

Input A (activations)	Input B (weights)	Accumulator	Outputs	Use case
bf16	s8	s32	s32, f32, bf16, s8, u8	BF16 activations with int8 weights
bf16	s4	f32	f32, bf16	BF16 activations with 4-bit weights
bf16	u4	f32	f32, bf16	BF16 activations with unsigned 4-bit weights
f32	s8	s32	s32, f32, bf16, s8, u8	F32 activations with int8 weights

These variants handle on-the-fly quantization of the higher-precision input internally.

The `dlp_quant_op` Structure

For advanced quantization workflows, dlp_metadata_t supports pre- and post-quantization operations via the dlp_quant_op struct:

typedef struct {
    md_t      group_size;  // elements per quantization group
    DLP_TYPE  src_type;    // source type (e.g., DLP_BF16)
    DLP_TYPE  dst_type;    // destination type (e.g., DLP_S8)
    dlp_sf_t* scl;         // scale factors
    dlp_zp_t* zp;          // zero-points (NULL for symmetric)
    bool      symmetric;   // true = symmetric, false = asymmetric
} dlp_quant_op;

These can be attached to dlp_metadata_t as:

a_pre_quant / b_pre_quant -- quantize inputs before GEMM
a_post_quant / b_post_quant -- quantize after GEMM

Tips

Calibrate scales carefully -- Scale factors significantly impact accuracy. Use representative calibration data.
Validate against float baselines -- Compare quantized output against f32 GEMM to verify acceptable accuracy loss.
Use per-channel quantization for better accuracy at minimal performance cost compared to per-tensor.
Use per-token (PerM) scales for activations in W8A8 inference -- set scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN with scale_factor_len = m on the SCALE post-op (one scale per row/token).
Reorder quantized weights -- Pre-reorder weights for repeated inference calls using aocl_reorder_* functions.
Choose output type wisely -- Writing quantized output (os8, ou8) avoids a separate requantization pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization Guide

Quantization Guide

Quantization Concepts

Integer GEMM Variants

Example: Quantized Inference Layer

Dequantize Output with Post-Ops

Per-Token (PerM) Dequantization

Symmetric Quantization GEMM

Mixed-Precision Quantized GEMM

The `dlp_quant_op` Structure

Tips

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Quantization Guide

Quantization Guide

Quantization Concepts

Integer GEMM Variants

Example: Quantized Inference Layer

Dequantize Output with Post-Ops

Per-Token (PerM) Dequantization

Symmetric Quantization GEMM

Mixed-Precision Quantized GEMM

The dlp_quant_op Structure

Tips

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

The `dlp_quant_op` Structure