-
Notifications
You must be signed in to change notification settings - Fork 5
GEMM Guide
This guide covers everything you need to know about using AOCL-DLP's General Matrix Multiplication (GEMM) operations: how to choose the right variant, how parameters work, and how to get the best performance.
Every GEMM call computes:
C = alpha * op(A) * op(B) + beta * C
Where op(X) is either X (no transpose) or X^T (transpose), and optional fused post-operations can be applied to the result via dlp_metadata_t.
Function names follow a consistent pattern:
aocl_gemm_<A_type><B_type><accumulator_type>o<output_type>
| Short Name | C Type | Bits | Description |
|---|---|---|---|
f32 |
float |
32 | Single-precision float |
f16 / fp16
|
float16 |
16 | IEEE 754 half-precision |
bf16 |
bfloat16 |
16 | Brain floating point |
s8 |
int8_t |
8 | Signed 8-bit integer |
u8 |
uint8_t |
8 | Unsigned 8-bit integer |
s4 |
int4 (packed in int8_t) |
4 | Signed 4-bit integer |
u4 |
uint4 (packed in int8_t) |
4 | Unsigned 4-bit integer |
s32 |
int32_t |
32 | Signed 32-bit integer |
Example: aocl_gemm_bf16bf16f32of32 means bfloat16 inputs, float32 accumulation, float32 output.
| Input A | Input B | Accumulator | Supported Outputs | Min ISA |
|---|---|---|---|---|
| f32 | f32 | f32 | f32 | AVX2 |
| f16 | f16 | f16 | f16, f32 | AVX512_FP16 |
| f32 | f16 | f32 | f32 | AVX512 |
| Input A | Input B | Accumulator | Supported Outputs | Min ISA |
|---|---|---|---|---|
| bf16 | bf16 | f32 | f32, bf16 | AVX2 (*) |
| bf16 | s4 | f32 | f32, bf16 | AVX512 |
| bf16 | u4 | f32 | f32, bf16 | AVX512 |
| bf16 | s8 | s32 | s32, f32, bf16, s8, u8 | AVX512_VNNI |
| Input A | Input B | Accumulator | Supported Outputs | Min ISA |
|---|---|---|---|---|
| f32 | s8 | s32 | s32, f32, bf16, s8, u8 | AVX512_VNNI |
| Input A | Input B | Accumulator | Supported Outputs | Min ISA |
|---|---|---|---|---|
| u8 | s8 | s32 | s32, s8, u8, f32, bf16, f16 | AVX512_VNNI |
| s8 | s8 | s32 | s32, s8, u8, f32, bf16, f16 | AVX512_VNNI |
| Input A | Input B | Accumulator | Supported Outputs | Min ISA |
|---|---|---|---|---|
| s8 | s8 (sym_quant) | s32 | f32, bf16 | AVX512_VNNI |
(*) BFloat16 operations on hardware without native AVX512_BF16 automatically fall back to float32 kernels with transparent conversion. See Library Overview for details.
All GEMM functions share a common parameter pattern. Here is aocl_gemm_f32f32f32of32 as the reference:
aocl_gemm_f32f32f32of32(
const char order, // 'R' = row-major, 'C' = column-major
const char transa, // 'N' = no transpose, 'T' = transpose
const char transb, // 'N' = no transpose, 'T' = transpose
const md_t m, // rows of A (and C)
const md_t n, // columns of B (and C)
const md_t k, // columns of A / rows of B
const float alpha, // scalar: C = alpha * A*B + beta*C
const float* a, // pointer to matrix A
const md_t lda, // leading dimension of A
const char mem_format_a, // 'N' = normal, 'P' = packed, 'R' = reordered
const float* b, // pointer to matrix B
const md_t ldb, // leading dimension of B
const char mem_format_b, // 'N' = normal, 'P' = packed, 'R' = reordered
const float beta, // scalar: C = alpha * A*B + beta*C
float* c, // pointer to matrix C (output)
const md_t ldc, // leading dimension of C
dlp_metadata_t* metadata // post-operations (NULL for none)
);| Value | Layout | Data arrangement | Typical use |
|---|---|---|---|
'R' |
Row-major |
A[i][j] stored at a[i * lda + j]
|
C/C++ (most common) |
'C' |
Column-major |
A[i][j] stored at a[j * lda + i]
|
Fortran, BLAS interop |
The leading dimension is the stride between consecutive rows (row-major) or columns (column-major) in memory. For a row-major matrix with n columns, ld >= n. Using ld > n allows embedding a matrix within a larger allocation.
Row-major example (m=3, n=4, lda=6):
Memory: [a00 a01 a02 a03 ___ ___ a10 a11 a12 a13 ___ ___ a20 a21 a22 a23 ___ ___]
|--- lda=6 elements ---|
| Value | Meaning | Effect |
|---|---|---|
'N' |
No transpose | Use matrix as-is |
'T' |
Transpose | Swap rows and columns |
When transa = 'T', the matrix A is logically transposed: an M x K result is read from a K x M stored matrix.
| Value | Meaning | When to use |
|---|---|---|
'N' |
Normal (plain layout) | Default for all inputs |
'P' |
Packed | Matrix has been packed for cache optimization |
'R' |
Reordered | Matrix was reordered via aocl_reorder_*()
|
Use 'R' when you have pre-reordered a matrix for repeated use (see Matrix Reordering below).
| alpha | beta | Effective operation |
|---|---|---|
| 1.0 | 0.0 |
C = A * B (fresh computation) |
| 1.0 | 1.0 |
C = A * B + C (accumulate) |
| 2.0 | 0.0 |
C = 2 * A * B (scaled) |
| 1.0 | -1.0 | C = A * B - C |
For integer GEMM variants, alpha and beta are int32_t.
For workloads where the same matrix (typically weights) is used in repeated GEMM calls, reordering that matrix into an optimized internal layout can significantly improve performance.
// 1. Query required buffer size
msz_t buf_size = aocl_get_reorder_buf_size_f32f32f32of32(
'R', 'N', 'B', k, n, NULL);
// 2. Allocate the reorder buffer
float *b_reordered = (float *)malloc(buf_size);
// 3. Reorder the matrix
aocl_reorder_f32f32f32of32(
'R', 'N', 'B', // order, transpose, matrix type ('B' = matrix B)
b_original, // input
b_reordered, // output
k, n, ldb, NULL);
// 4. Use in GEMM with mem_format_b = 'R'
aocl_gemm_f32f32f32of32(
'R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N',
b_reordered, ldb, 'R', // <-- reordered format
0.0f, c, ldc, NULL);
// 5. (Optional) Convert back to original layout
aocl_unreorder_f32f32f32of32_reference(
'R', 'B', b_reordered, b_output, k, n, ldb, NULL);| Data Type | get_reorder_buf_size | reorder | unreorder |
|---|---|---|---|
| f32f32f32of32 | Yes | Yes | Yes (ref) |
| bf16bf16f32of32 | Yes | Yes | Yes |
| u8s8s32os32 | Yes | Yes | -- |
| s8s8s32os32 | Yes | Yes | Yes (ref) |
| bf16s4f32of32 | Yes | Yes | -- |
| u8s4s32os32 | Yes | Yes | -- |
| f16f16f16of16 | Yes | Yes | Yes |
| f32f16f32of32 | Yes | Yes | -- |
| s8s8s32os32 (sym_quant) | Yes | Yes | -- |
| f32 to bf16 (mixed) | -- | Yes (aocl_reorder_f32obf16) |
-- |
Batch GEMM processes multiple independent GEMM operations in a single call, enabling better hardware utilization for workloads with many small matrices.
See the dedicated Batch GEMM Guide for the grouping model (
group_count/group_size[]), the full availability matrix, reordered B in batch mode, and post-ops indexing. The snippet and table below are a quick teaser.
// Arrays of matrix pointers (one per operation in the batch)
const float *a_array[] = { a0, a1, a2 };
const float *b_array[] = { b0, b1, b2 };
float *c_array[] = { c0, c1, c2 };
md_t group_size[] = { 3 }; // all 3 operations in one group
aocl_batch_gemm_f32f32f32of32(
(const char[]){'R','R','R'}, // order per operation
(const char[]){'N','N','N'}, // transa
(const char[]){'N','N','N'}, // transb
(const md_t[]){m, m, m}, // m per operation
(const md_t[]){n, n, n}, // n
(const md_t[]){k, k, k}, // k
(const float[]){1.0f, 1.0f, 1.0f}, // alpha
a_array, (const md_t[]){lda, lda, lda},
b_array, (const md_t[]){ldb, ldb, ldb},
(const float[]){0.0f, 0.0f, 0.0f}, // beta
c_array, (const md_t[]){ldc, ldc, ldc},
1, // group_count
group_size, // group_size array
(const char[]){'N','N','N'}, // mem_format_a
(const char[]){'N','N','N'}, // mem_format_b
NULL // metadata (NULL = no post-ops)
);The families below all have batch variants; the full per-output matrix (including f16, f32xf16, bf16xu4, bf16xs8, f32xs8, and sym-quant) lives in the Batch GEMM Guide.
| Data Type Family | Batch GEMM |
|---|---|
| f32 x f32 | Yes |
| bf16 x bf16 (f32 and bf16 output) | Yes |
| bf16 x s4 (f32 and bf16 output) | Yes |
| u8 x s8 (s32, s8, u8, f32, bf16 output) | Yes |
| s8 x s8 (s32, s8, f32, bf16, u8 output) | Yes |
By precision need:
| Need | Recommended | Why |
|---|---|---|
| Maximum accuracy | f32f32f32of32 |
Full 32-bit precision throughout |
| Good accuracy, less memory | bf16bf16f32of32 |
BF16 inputs save memory, f32 accumulation preserves range |
| Quantized inference |
u8s8s32os32 or s8s8s32os32
|
Integer math is fastest on VNNI hardware |
| Weight-quantized inference | bf16s4f32of32 |
BF16 activations with 4-bit weights |
| Half-precision pipeline | f16f16f16of16 |
Native FP16 end-to-end (requires AVX512_FP16) |
By hardware:
| Your CPU | Best variants |
|---|---|
| AMD Zen1-3 (AVX2 only) | f32, bf16 (auto-fallback to f32) |
| AMD Zen4 (AVX512, VNNI, BF16) | All variants including native bf16 and integer |
| AMD Zen5 / Zen6 (AVX512_FP16) | All variants including native f16 |
Any GEMM variant can accept a dlp_metadata_t* as the last parameter to apply fused post-operations (BIAS, activations, SCALE, etc.) to the result before writing to C. See the Post-Operations Guide for complete documentation.
GEMM functions validate parameters and report errors via dlp_metadata_t.error_hndl when a metadata pointer is provided:
dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);
if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
// Handle error -- see dlp_errors.h for error codes
}Common error codes (from dlp_clsc_err_t):
| Code | Meaning |
|---|---|
DLP_CLSC_SUCCESS |
Operation completed successfully |
DLP_CLSC_NULL_POINTER |
NULL pointer passed as argument |
DLP_CLSC_INVALID_MATRIX_DIMENSION |
Invalid m, n, or k |
DLP_CLSC_INVALID_LEADING_DIMENSION |
Leading dimension too small |
DLP_CLSC_INVALID_ORDER |
Invalid memory layout character |
DLP_CLSC_INVALID_TRANSPOSE |
Invalid transpose character |
DLP_CLSC_INVALID_MEMORY_TAG |
Invalid mem_format character |
- Post-Operations Guide -- Fusing BIAS, activations, and scaling with GEMM
- Quantization Guide -- Symmetric quantization and mixed-precision workflows
- Performance Guide -- Threading, NUMA, and memory optimization
- Examples & Tutorials -- Working code examples
- API Reference -- Generated API documentation
Getting Started
User Guides
- Library Overview
- GEMM Guide
- Batch GEMM Guide
- Post-Operations
- Eltwise Operations
- Quantization
- API Lifecycle
Performance & Config
Testing & Benchmarking
Developer Guides
Reference