Skip to content

GEMM Guide

Nallani Bhaskar edited this page Jun 15, 2026 · 4 revisions

GEMM Guide

This guide covers everything you need to know about using AOCL-DLP's General Matrix Multiplication (GEMM) operations: how to choose the right variant, how parameters work, and how to get the best performance.

Core Operation

Every GEMM call computes:

C = alpha * op(A) * op(B) + beta * C

Where op(X) is either X (no transpose) or X^T (transpose), and optional fused post-operations can be applied to the result via dlp_metadata_t.

Naming Convention

Function names follow a consistent pattern:

aocl_gemm_<A_type><B_type><accumulator_type>o<output_type>
Short Name C Type Bits Description
f32 float 32 Single-precision float
f16 / fp16 float16 16 IEEE 754 half-precision
bf16 bfloat16 16 Brain floating point
s8 int8_t 8 Signed 8-bit integer
u8 uint8_t 8 Unsigned 8-bit integer
s4 int4 (packed in int8_t) 4 Signed 4-bit integer
u4 uint4 (packed in int8_t) 4 Unsigned 4-bit integer
s32 int32_t 32 Signed 32-bit integer

Example: aocl_gemm_bf16bf16f32of32 means bfloat16 inputs, float32 accumulation, float32 output.

Supported Data Type Combinations

Float Precision

Input A Input B Accumulator Supported Outputs Min ISA
f32 f32 f32 f32 AVX2
f16 f16 f16 f16, f32 AVX512_FP16
f32 f16 f32 f32 AVX512

BFloat16 Precision

Input A Input B Accumulator Supported Outputs Min ISA
bf16 bf16 f32 f32, bf16 AVX2 (*)
bf16 s4 f32 f32, bf16 AVX512
bf16 u4 f32 f32, bf16 AVX512
bf16 s8 s32 s32, f32, bf16, s8, u8 AVX512_VNNI

Float-Quantized Mixed Precision

Input A Input B Accumulator Supported Outputs Min ISA
f32 s8 s32 s32, f32, bf16, s8, u8 AVX512_VNNI

Integer Quantized

Input A Input B Accumulator Supported Outputs Min ISA
u8 s8 s32 s32, s8, u8, f32, bf16, f16 AVX512_VNNI
s8 s8 s32 s32, s8, u8, f32, bf16, f16 AVX512_VNNI

Symmetric Quantization

Input A Input B Accumulator Supported Outputs Min ISA
s8 s8 (sym_quant) s32 f32, bf16 AVX512_VNNI

(*) BFloat16 operations on hardware without native AVX512_BF16 automatically fall back to float32 kernels with transparent conversion. See Library Overview for details.

Function Parameters

All GEMM functions share a common parameter pattern. Here is aocl_gemm_f32f32f32of32 as the reference:

aocl_gemm_f32f32f32of32(
    const char      order,        // 'R' = row-major, 'C' = column-major
    const char      transa,       // 'N' = no transpose, 'T' = transpose
    const char      transb,       // 'N' = no transpose, 'T' = transpose
    const md_t      m,            // rows of A (and C)
    const md_t      n,            // columns of B (and C)
    const md_t      k,            // columns of A / rows of B
    const float     alpha,        // scalar: C = alpha * A*B + beta*C
    const float*    a,            // pointer to matrix A
    const md_t      lda,          // leading dimension of A
    const char      mem_format_a, // 'N' = normal, 'P' = packed, 'R' = reordered
    const float*    b,            // pointer to matrix B
    const md_t      ldb,          // leading dimension of B
    const char      mem_format_b, // 'N' = normal, 'P' = packed, 'R' = reordered
    const float     beta,         // scalar: C = alpha * A*B + beta*C
    float*          c,            // pointer to matrix C (output)
    const md_t      ldc,          // leading dimension of C
    dlp_metadata_t* metadata      // post-operations (NULL for none)
);

Memory Layout (order)

Value Layout Data arrangement Typical use
'R' Row-major A[i][j] stored at a[i * lda + j] C/C++ (most common)
'C' Column-major A[i][j] stored at a[j * lda + i] Fortran, BLAS interop

Leading Dimension (lda, ldb, ldc)

The leading dimension is the stride between consecutive rows (row-major) or columns (column-major) in memory. For a row-major matrix with n columns, ld >= n. Using ld > n allows embedding a matrix within a larger allocation.

Row-major example (m=3, n=4, lda=6):

Memory: [a00 a01 a02 a03 ___ ___ a10 a11 a12 a13 ___ ___ a20 a21 a22 a23 ___ ___]
         |--- lda=6 elements ---|

Transpose Options (transa, transb)

Value Meaning Effect
'N' No transpose Use matrix as-is
'T' Transpose Swap rows and columns

When transa = 'T', the matrix A is logically transposed: an M x K result is read from a K x M stored matrix.

Memory Format Tags (mem_format_a, mem_format_b)

Value Meaning When to use
'N' Normal (plain layout) Default for all inputs
'P' Packed Matrix has been packed for cache optimization
'R' Reordered Matrix was reordered via aocl_reorder_*()

Use 'R' when you have pre-reordered a matrix for repeated use (see Matrix Reordering below).

Alpha and Beta Scalars

alpha beta Effective operation
1.0 0.0 C = A * B (fresh computation)
1.0 1.0 C = A * B + C (accumulate)
2.0 0.0 C = 2 * A * B (scaled)
1.0 -1.0 C = A * B - C

For integer GEMM variants, alpha and beta are int32_t.

Matrix Reordering

For workloads where the same matrix (typically weights) is used in repeated GEMM calls, reordering that matrix into an optimized internal layout can significantly improve performance.

Workflow

// 1. Query required buffer size
msz_t buf_size = aocl_get_reorder_buf_size_f32f32f32of32(
    'R', 'N', 'B', k, n, NULL);

// 2. Allocate the reorder buffer
float *b_reordered = (float *)malloc(buf_size);

// 3. Reorder the matrix
aocl_reorder_f32f32f32of32(
    'R', 'N', 'B',        // order, transpose, matrix type ('B' = matrix B)
    b_original,            // input
    b_reordered,           // output
    k, n, ldb, NULL);

// 4. Use in GEMM with mem_format_b = 'R'
aocl_gemm_f32f32f32of32(
    'R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N',
    b_reordered, ldb, 'R',  // <-- reordered format
    0.0f, c, ldc, NULL);

// 5. (Optional) Convert back to original layout
aocl_unreorder_f32f32f32of32_reference(
    'R', 'B', b_reordered, b_output, k, n, ldb, NULL);

Reorder Support by Data Type

Data Type get_reorder_buf_size reorder unreorder
f32f32f32of32 Yes Yes Yes (ref)
bf16bf16f32of32 Yes Yes Yes
u8s8s32os32 Yes Yes --
s8s8s32os32 Yes Yes Yes (ref)
bf16s4f32of32 Yes Yes --
u8s4s32os32 Yes Yes --
f16f16f16of16 Yes Yes Yes
f32f16f32of32 Yes Yes --
s8s8s32os32 (sym_quant) Yes Yes --
f32 to bf16 (mixed) -- Yes (aocl_reorder_f32obf16) --

Batch GEMM

Batch GEMM processes multiple independent GEMM operations in a single call, enabling better hardware utilization for workloads with many small matrices.

See the dedicated Batch GEMM Guide for the grouping model (group_count / group_size[]), the full availability matrix, reordered B in batch mode, and post-ops indexing. The snippet and table below are a quick teaser.

// Arrays of matrix pointers (one per operation in the batch)
const float *a_array[] = { a0, a1, a2 };
const float *b_array[] = { b0, b1, b2 };
float       *c_array[] = { c0, c1, c2 };

md_t group_size[] = { 3 };  // all 3 operations in one group

aocl_batch_gemm_f32f32f32of32(
    (const char[]){'R','R','R'},    // order per operation
    (const char[]){'N','N','N'},    // transa
    (const char[]){'N','N','N'},    // transb
    (const md_t[]){m, m, m},        // m per operation
    (const md_t[]){n, n, n},        // n
    (const md_t[]){k, k, k},        // k
    (const float[]){1.0f, 1.0f, 1.0f},  // alpha
    a_array, (const md_t[]){lda, lda, lda},
    b_array, (const md_t[]){ldb, ldb, ldb},
    (const float[]){0.0f, 0.0f, 0.0f},  // beta
    c_array, (const md_t[]){ldc, ldc, ldc},
    1,             // group_count
    group_size,    // group_size array
    (const char[]){'N','N','N'},  // mem_format_a
    (const char[]){'N','N','N'},  // mem_format_b
    NULL           // metadata (NULL = no post-ops)
);

Batch GEMM Availability

The families below all have batch variants; the full per-output matrix (including f16, f32xf16, bf16xu4, bf16xs8, f32xs8, and sym-quant) lives in the Batch GEMM Guide.

Data Type Family Batch GEMM
f32 x f32 Yes
bf16 x bf16 (f32 and bf16 output) Yes
bf16 x s4 (f32 and bf16 output) Yes
u8 x s8 (s32, s8, u8, f32, bf16 output) Yes
s8 x s8 (s32, s8, f32, bf16, u8 output) Yes

Choosing the Right Variant

By precision need:

Need Recommended Why
Maximum accuracy f32f32f32of32 Full 32-bit precision throughout
Good accuracy, less memory bf16bf16f32of32 BF16 inputs save memory, f32 accumulation preserves range
Quantized inference u8s8s32os32 or s8s8s32os32 Integer math is fastest on VNNI hardware
Weight-quantized inference bf16s4f32of32 BF16 activations with 4-bit weights
Half-precision pipeline f16f16f16of16 Native FP16 end-to-end (requires AVX512_FP16)

By hardware:

Your CPU Best variants
AMD Zen1-3 (AVX2 only) f32, bf16 (auto-fallback to f32)
AMD Zen4 (AVX512, VNNI, BF16) All variants including native bf16 and integer
AMD Zen5 / Zen6 (AVX512_FP16) All variants including native f16

Post-Operations

Any GEMM variant can accept a dlp_metadata_t* as the last parameter to apply fused post-operations (BIAS, activations, SCALE, etc.) to the result before writing to C. See the Post-Operations Guide for complete documentation.

Error Handling

GEMM functions validate parameters and report errors via dlp_metadata_t.error_hndl when a metadata pointer is provided:

dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);

if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
    // Handle error -- see dlp_errors.h for error codes
}

Common error codes (from dlp_clsc_err_t):

Code Meaning
DLP_CLSC_SUCCESS Operation completed successfully
DLP_CLSC_NULL_POINTER NULL pointer passed as argument
DLP_CLSC_INVALID_MATRIX_DIMENSION Invalid m, n, or k
DLP_CLSC_INVALID_LEADING_DIMENSION Leading dimension too small
DLP_CLSC_INVALID_ORDER Invalid memory layout character
DLP_CLSC_INVALID_TRANSPOSE Invalid transpose character
DLP_CLSC_INVALID_MEMORY_TAG Invalid mem_format character

See Also

Clone this wiki locally