GEMM Guide

This guide covers everything you need to know about using AOCL-DLP's General Matrix Multiplication (GEMM) operations: how to choose the right variant, how parameters work, and how to get the best performance.

Core Operation

Every GEMM call computes:

C = alpha * op(A) * op(B) + beta * C

Where op(X) is either X (no transpose) or X^T (transpose), and optional fused post-operations can be applied to the result via dlp_metadata_t.

Naming Convention

Function names follow a consistent pattern:

aocl_gemm_<A_type><B_type><accumulator_type>o<output_type>

Short Name	C Type	Bits	Description
`f32`	`float`	32	Single-precision float
`f16` / `fp16`	`float16`	16	IEEE 754 half-precision
`bf16`	`bfloat16`	16	Brain floating point
`s8`	`int8_t`	8	Signed 8-bit integer
`u8`	`uint8_t`	8	Unsigned 8-bit integer
`s4`	int4 (packed in `int8_t`)	4	Signed 4-bit integer
`u4`	uint4 (packed in `int8_t`)	4	Unsigned 4-bit integer
`s32`	`int32_t`	32	Signed 32-bit integer

Example: aocl_gemm_bf16bf16f32of32 means bfloat16 inputs, float32 accumulation, float32 output.

Supported Data Type Combinations

Float Precision

Input A	Input B	Accumulator	Supported Outputs	Min ISA
f32	f32	f32	f32	AVX2
f16	f16	f16	f16, f32	AVX512_FP16
f32	f16	f32	f32	AVX512

BFloat16 Precision

Input A	Input B	Accumulator	Supported Outputs	Min ISA
bf16	bf16	f32	f32, bf16	AVX2 (*)
bf16	s4	f32	f32, bf16	AVX512
bf16	u4	f32	f32, bf16	AVX512
bf16	s8	s32	s32, f32, bf16, s8, u8	AVX512_VNNI

Float-Quantized Mixed Precision

Input A	Input B	Accumulator	Supported Outputs	Min ISA
f32	s8	s32	s32, f32, bf16, s8, u8	AVX512_VNNI

Integer Quantized

Input A	Input B	Accumulator	Supported Outputs	Min ISA
u8	s8	s32	s32, s8, u8, f32, bf16, f16	AVX512_VNNI
s8	s8	s32	s32, s8, u8, f32, bf16, f16	AVX512_VNNI

Symmetric Quantization

Input A	Input B	Accumulator	Supported Outputs	Min ISA
s8	s8 (sym_quant)	s32	f32, bf16	AVX512_VNNI

(*) BFloat16 operations on hardware without native AVX512_BF16 automatically fall back to float32 kernels with transparent conversion. See Library Overview for details.

Function Parameters

All GEMM functions share a common parameter pattern. Here is aocl_gemm_f32f32f32of32 as the reference:

aocl_gemm_f32f32f32of32(
    const char      order,        // 'R' = row-major, 'C' = column-major
    const char      transa,       // 'N' = no transpose, 'T' = transpose
    const char      transb,       // 'N' = no transpose, 'T' = transpose
    const md_t      m,            // rows of A (and C)
    const md_t      n,            // columns of B (and C)
    const md_t      k,            // columns of A / rows of B
    const float     alpha,        // scalar: C = alpha * A*B + beta*C
    const float*    a,            // pointer to matrix A
    const md_t      lda,          // leading dimension of A
    const char      mem_format_a, // 'N' = normal, 'P' = packed, 'R' = reordered
    const float*    b,            // pointer to matrix B
    const md_t      ldb,          // leading dimension of B
    const char      mem_format_b, // 'N' = normal, 'P' = packed, 'R' = reordered
    const float     beta,         // scalar: C = alpha * A*B + beta*C
    float*          c,            // pointer to matrix C (output)
    const md_t      ldc,          // leading dimension of C
    dlp_metadata_t* metadata      // post-operations (NULL for none)
);

Memory Layout (`order`)

Value	Layout	Data arrangement	Typical use
`'R'`	Row-major	`A[i][j]` stored at `a[i * lda + j]`	C/C++ (most common)
`'C'`	Column-major	`A[i][j]` stored at `a[j * lda + i]`	Fortran, BLAS interop

Leading Dimension (`lda`, `ldb`, `ldc`)

The leading dimension is the stride between consecutive rows (row-major) or columns (column-major) in memory. For a row-major matrix with n columns, ld >= n. Using ld > n allows embedding a matrix within a larger allocation.

Row-major example (m=3, n=4, lda=6):

Memory: [a00 a01 a02 a03 ___ ___ a10 a11 a12 a13 ___ ___ a20 a21 a22 a23 ___ ___]
         |--- lda=6 elements ---|

Transpose Options (`transa`, `transb`)

Value	Meaning	Effect
`'N'`	No transpose	Use matrix as-is
`'T'`	Transpose	Swap rows and columns

When transa = 'T', the matrix A is logically transposed: an M x K result is read from a K x M stored matrix.

Memory Format Tags (`mem_format_a`, `mem_format_b`)

Value	Meaning	When to use
`'N'`	Normal (plain layout)	Default for all inputs
`'P'`	Packed	Matrix has been packed for cache optimization
`'R'`	Reordered	Matrix was reordered via `aocl_reorder_*()`

Use 'R' when you have pre-reordered a matrix for repeated use (see Matrix Reordering below).

Alpha and Beta Scalars

alpha	beta	Effective operation
1.0	0.0	`C = A * B` (fresh computation)
1.0	1.0	`C = A * B + C` (accumulate)
2.0	0.0	`C = 2 * A * B` (scaled)
1.0	-1.0	`C = A * B - C`

For integer GEMM variants, alpha and beta are int32_t.

Matrix Reordering

For workloads where the same matrix (typically weights) is used in repeated GEMM calls, reordering that matrix into an optimized internal layout can significantly improve performance.

Workflow

// 1. Query required buffer size
msz_t buf_size = aocl_get_reorder_buf_size_f32f32f32of32(
    'R', 'N', 'B', k, n, NULL);

// 2. Allocate the reorder buffer
float *b_reordered = (float *)malloc(buf_size);

// 3. Reorder the matrix
aocl_reorder_f32f32f32of32(
    'R', 'N', 'B',        // order, transpose, matrix type ('B' = matrix B)
    b_original,            // input
    b_reordered,           // output
    k, n, ldb, NULL);

// 4. Use in GEMM with mem_format_b = 'R'
aocl_gemm_f32f32f32of32(
    'R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N',
    b_reordered, ldb, 'R',  // <-- reordered format
    0.0f, c, ldc, NULL);

// 5. (Optional) Convert back to original layout
aocl_unreorder_f32f32f32of32_reference(
    'R', 'B', b_reordered, b_output, k, n, ldb, NULL);

Reorder Support by Data Type

Data Type	get_reorder_buf_size	reorder	unreorder
f32f32f32of32	Yes	Yes	Yes (ref)
bf16bf16f32of32	Yes	Yes	Yes
u8s8s32os32	Yes	Yes	--
s8s8s32os32	Yes	Yes	Yes (ref)
bf16s4f32of32	Yes	Yes	--
u8s4s32os32	Yes	Yes	--
f16f16f16of16	Yes	Yes	Yes
f32f16f32of32	Yes	Yes	--
s8s8s32os32 (sym_quant)	Yes	Yes	--
f32 to bf16 (mixed)	--	Yes (`aocl_reorder_f32obf16`)	--

Batch GEMM

Batch GEMM processes multiple independent GEMM operations in a single call, enabling better hardware utilization for workloads with many small matrices.

See the dedicated Batch GEMM Guide for the grouping model (group_count / group_size[]), the full availability matrix, reordered B in batch mode, and post-ops indexing. The snippet and table below are a quick teaser.

// Arrays of matrix pointers (one per operation in the batch)
const float *a_array[] = { a0, a1, a2 };
const float *b_array[] = { b0, b1, b2 };
float       *c_array[] = { c0, c1, c2 };

md_t group_size[] = { 3 };  // all 3 operations in one group

aocl_batch_gemm_f32f32f32of32(
    (const char[]){'R','R','R'},    // order per operation
    (const char[]){'N','N','N'},    // transa
    (const char[]){'N','N','N'},    // transb
    (const md_t[]){m, m, m},        // m per operation
    (const md_t[]){n, n, n},        // n
    (const md_t[]){k, k, k},        // k
    (const float[]){1.0f, 1.0f, 1.0f},  // alpha
    a_array, (const md_t[]){lda, lda, lda},
    b_array, (const md_t[]){ldb, ldb, ldb},
    (const float[]){0.0f, 0.0f, 0.0f},  // beta
    c_array, (const md_t[]){ldc, ldc, ldc},
    1,             // group_count
    group_size,    // group_size array
    (const char[]){'N','N','N'},  // mem_format_a
    (const char[]){'N','N','N'},  // mem_format_b
    NULL           // metadata (NULL = no post-ops)
);

Batch GEMM Availability

The families below all have batch variants; the full per-output matrix (including f16, f32xf16, bf16xu4, bf16xs8, f32xs8, and sym-quant) lives in the Batch GEMM Guide.

Data Type Family	Batch GEMM
f32 x f32	Yes
bf16 x bf16 (f32 and bf16 output)	Yes
bf16 x s4 (f32 and bf16 output)	Yes
u8 x s8 (s32, s8, u8, f32, bf16 output)	Yes
s8 x s8 (s32, s8, f32, bf16, u8 output)	Yes

Choosing the Right Variant

By precision need:

Need	Recommended	Why
Maximum accuracy	`f32f32f32of32`	Full 32-bit precision throughout
Good accuracy, less memory	`bf16bf16f32of32`	BF16 inputs save memory, f32 accumulation preserves range
Quantized inference	`u8s8s32os32` or `s8s8s32os32`	Integer math is fastest on VNNI hardware
Weight-quantized inference	`bf16s4f32of32`	BF16 activations with 4-bit weights
Half-precision pipeline	`f16f16f16of16`	Native FP16 end-to-end (requires AVX512_FP16)

By hardware:

Your CPU	Best variants
AMD Zen1-3 (AVX2 only)	f32, bf16 (auto-fallback to f32)
AMD Zen4 (AVX512, VNNI, BF16)	All variants including native bf16 and integer
AMD Zen5 / Zen6 (AVX512_FP16)	All variants including native f16

Post-Operations

Any GEMM variant can accept a dlp_metadata_t* as the last parameter to apply fused post-operations (BIAS, activations, SCALE, etc.) to the result before writing to C. See the Post-Operations Guide for complete documentation.

Error Handling

GEMM functions validate parameters and report errors via dlp_metadata_t.error_hndl when a metadata pointer is provided:

dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);

if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
    // Handle error -- see dlp_errors.h for error codes
}

Common error codes (from dlp_clsc_err_t):

Code	Meaning
`DLP_CLSC_SUCCESS`	Operation completed successfully
`DLP_CLSC_NULL_POINTER`	NULL pointer passed as argument
`DLP_CLSC_INVALID_MATRIX_DIMENSION`	Invalid m, n, or k
`DLP_CLSC_INVALID_LEADING_DIMENSION`	Leading dimension too small
`DLP_CLSC_INVALID_ORDER`	Invalid memory layout character
`DLP_CLSC_INVALID_TRANSPOSE`	Invalid transpose character
`DLP_CLSC_INVALID_MEMORY_TAG`	Invalid mem_format character

Uh oh!

GEMM Guide

GEMM Guide

Core Operation

Naming Convention

Supported Data Type Combinations

Float Precision

BFloat16 Precision

Float-Quantized Mixed Precision

Integer Quantized

Symmetric Quantization

Function Parameters

Memory Layout (order)

Leading Dimension (lda, ldb, ldc)

Transpose Options (transa, transb)

Memory Format Tags (mem_format_a, mem_format_b)

Alpha and Beta Scalars

Matrix Reordering

Workflow

Reorder Support by Data Type

Batch GEMM

Batch GEMM Availability

Choosing the Right Variant

Post-Operations

Error Handling

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Memory Layout (`order`)

Leading Dimension (`lda`, `ldb`, `ldc`)

Transpose Options (`transa`, `transb`)

Memory Format Tags (`mem_format_a`, `mem_format_b`)