Skip to content

Quantization Guide

Nallani Bhaskar edited this page Jun 15, 2026 · 4 revisions

Quantization Guide

AOCL-DLP supports quantized GEMM operations for efficient inference workloads. This guide covers symmetric quantization, mixed-precision workflows, and how to configure scale factors and zero-points.

Quantization Concepts

Quantization maps floating-point values to lower-precision integers for faster computation and smaller memory footprint.

Symmetric quantization centers the quantized range around zero:

q = round(x * scale)
x = q / scale

Asymmetric quantization uses a zero-point offset:

q = round(x * scale) - zero_point
x = (q + zero_point) / scale

Integer GEMM Variants

For workloads where both inputs are already quantized:

Input A Input B Accumulator Outputs Function pattern
u8 s8 s32 s32, s8, u8, f32, bf16 aocl_gemm_u8s8s32o*
s8 s8 s32 s32, s8, u8, f32, bf16 aocl_gemm_s8s8s32o*

These variants compute the GEMM in integer arithmetic (using AVX512_VNNI when available) and can output in various types. Use SCALE and BIAS post-ops for dequantization.

Example: Quantized Inference Layer

#include <aocl_dlp.h>

// Quantized activations (uint8) and weights (int8)
uint8_t activations[M * K] = { /* quantized input */ };
int8_t  weights[K * N]     = { /* quantized weights */ };
int32_t output[M * N]      = {0};

// Basic quantized GEMM (no post-ops, raw int32 accumulation)
aocl_gemm_u8s8s32os32(
    'R', 'N', 'N', m, n, k,
    1,                          // alpha (int32)
    activations, lda, 'N',
    weights, ldb, 'N',
    0,                          // beta (int32)
    output, ldc, NULL);

Dequantize Output with Post-Ops

To get float output from integer GEMM with dequantization:

// Scale factors for dequantization (one per output channel)
float scale_vals[N] = { /* calibrated scales */ };
dlp_sf_t sf = {
    .scale_factor      = scale_vals,
    .scale_factor_len  = n,
    .scale_factor_type = DLP_F32,
    .scale_factor_dim  = DLP_PARAM_DIM_PER_CHANNEL  // required for SCALE
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };

// Bias (applied after scaling)
float bias_vals[N] = { /* bias per channel */ };
dlp_post_op_bias bias_op = {
    .bias = bias_vals, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};

// Chain: SCALE then BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, BIAS };

dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.scale      = &scale_op;
meta.bias       = &bias_op;

float output_f32[M * N];
aocl_gemm_u8s8s32of32(
    'R', 'N', 'N', m, n, k,
    1, activations, lda, 'N',
    weights, ldb, 'N',
    0, output_f32, ldc, &meta);
// output_f32 = bias + scale * (activations * weights)

Per-Token (PerM) Dequantization

In W8A8 inference the serving stack pre-quantizes activations one row at a time: each output row (token) is quantized to int8 using its own scale, derived from that row's dynamic range. The library then runs the integer GEMM on s8 x s8 accumulating in s32, and the per-row dequantization is delivered as a SCALE post-op with one scale per output row (length m) and scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN.

This is the per-row counterpart of per-channel (length n) dequantization shown above -- the scale varies along m (rows) instead of n (columns):

Granularity scale_factor_dim scale_factor_len
Per-tensor DLP_PARAM_DIM_PER_TENSOR 1
Per-channel DLP_PARAM_DIM_PER_CHANNEL n (one per column)
Per-token (PerM) DLP_PARAM_DIM_PER_TOKEN m (one per row)
// Per-row dequantization scales recorded when activations were quantized
float a_dequant_sf[M] = { /* ... one value per output row */ };

dlp_sf_t sf = {
    .scale_factor      = a_dequant_sf,
    .scale_factor_len  = m,                       // one scale per row
    .scale_factor_type = DLP_F32,
    .scale_factor_dim  = DLP_PARAM_DIM_PER_TOKEN  // per-token / PerM granularity
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };

DLP_POST_OP_TYPE seq[] = { SCALE };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.scale      = &scale_op;

// s8 x s8 -> s32 accumulate, then per-token SCALE to f32
float output_f32[M * N];
aocl_gemm_s8s8s32of32(
    'R', 'N', 'N', m, n, k,
    1, activations_s8, lda, 'N',
    weights_s8, ldb, 'N',
    0, output_f32, ldc, &meta);
// output_f32[i][j] = a_dequant_sf[i] * (activations_s8 * weights_s8)[i][j]

Per-token SCALE also covers the decoder hot path, where the GEMM degenerates to a GEMV with n = 1 (one output column, m rows): each of the m rows still receives its own dequantization scale rather than a single broadcast value.

See simple_gemm_per_token_quant.c for a complete, self-contained example covering both the general m x n case and the n = 1 decoder case.

Symmetric Quantization GEMM

AOCL-DLP provides specialized symmetric quantization variants that handle grouped quantization natively:

  • aocl_gemm_s8s8s32of32_sym_quant
  • aocl_gemm_s8s8s32obf16_sym_quant

The reorder functions (aocl_get_reorder_buf_size_s8s8s32os32_sym_quant and aocl_reorder_s8s8s32os32_sym_quant) accept a DLP_SYMM_STAT_QUANT* parameter to pack quantization group metadata alongside the reordered matrix. The GEMM call itself uses the standard signature with dlp_metadata_t* as the last parameter.

// Symmetric quantization config
DLP_SYMM_STAT_QUANT symq = {
    .group_size = 128   // quantization group size (e.g., 128 elements per group)
};

// Reorder weights with symmetric quantization metadata
msz_t buf_size = aocl_get_reorder_buf_size_s8s8s32os32_sym_quant(
    'R', 'N', 'B', k, n, &symq, NULL);

int8_t *b_reordered = (int8_t *)malloc(buf_size);
aocl_reorder_s8s8s32os32_sym_quant(
    'R', 'N', 'B', weights, b_reordered, k, n, ldb, &symq, NULL);

// Compute with symmetric quantization
float output_f32[M * N];
aocl_gemm_s8s8s32of32_sym_quant(
    'R', 'N', 'N', m, n, k,
    1, activations_s8, lda, 'N',
    b_reordered, ldb, 'R',
    0, output_f32, ldc, NULL);

Mixed-Precision Quantized GEMM

For workloads where activations are in higher precision and weights are quantized:

Input A (activations) Input B (weights) Accumulator Outputs Use case
bf16 s8 s32 s32, f32, bf16, s8, u8 BF16 activations with int8 weights
bf16 s4 f32 f32, bf16 BF16 activations with 4-bit weights
bf16 u4 f32 f32, bf16 BF16 activations with unsigned 4-bit weights
f32 s8 s32 s32, f32, bf16, s8, u8 F32 activations with int8 weights

These variants handle on-the-fly quantization of the higher-precision input internally.

The dlp_quant_op Structure

For advanced quantization workflows, dlp_metadata_t supports pre- and post-quantization operations via the dlp_quant_op struct:

typedef struct {
    md_t      group_size;  // elements per quantization group
    DLP_TYPE  src_type;    // source type (e.g., DLP_BF16)
    DLP_TYPE  dst_type;    // destination type (e.g., DLP_S8)
    dlp_sf_t* scl;         // scale factors
    dlp_zp_t* zp;          // zero-points (NULL for symmetric)
    bool      symmetric;   // true = symmetric, false = asymmetric
} dlp_quant_op;

These can be attached to dlp_metadata_t as:

  • a_pre_quant / b_pre_quant -- quantize inputs before GEMM
  • a_post_quant / b_post_quant -- quantize after GEMM

Tips

  • Calibrate scales carefully -- Scale factors significantly impact accuracy. Use representative calibration data.
  • Validate against float baselines -- Compare quantized output against f32 GEMM to verify acceptable accuracy loss.
  • Use per-channel quantization for better accuracy at minimal performance cost compared to per-tensor.
  • Use per-token (PerM) scales for activations in W8A8 inference -- set scale_factor_dim = DLP_PARAM_DIM_PER_TOKEN with scale_factor_len = m on the SCALE post-op (one scale per row/token).
  • Reorder quantized weights -- Pre-reorder weights for repeated inference calls using aocl_reorder_* functions.
  • Choose output type wisely -- Writing quantized output (os8, ou8) avoids a separate requantization pass.

See Also

Clone this wiki locally