Skip to content

Library Overview

Vishal edited this page Feb 11, 2026 · 6 revisions

Library Overview

High-level overview of AOCL-DLP architecture, components, and design goals.

Components

  • GEMM kernels and drivers
  • Post-operations framework (metadata-driven)
  • Element-wise utilities
  • Threading and parallelization controls

Data Types

BFloat16 API Behavior

AOCL-DLP automatically handles BF16 operations on hardware lacking native AVX512_BF16 ISA support by transparently rerouting to F32 implementations.

Hardware Support:

  • Native BF16: Intel Cooper Lake/Sapphire Rapids+, AMD Zen4+ (uses AVX512_BF16 instructions)
  • F32 Fallback: Automatically activated on:
    • AVX2 machines (uses AVX2 F32 kernels)
    • AVX512 without BF16 support: Intel Skylake, Cascade Lake, Ice Lake (uses AVX512 F32 kernels)

Key Points:

  • BF16 API calls work unchanged across all hardware
  • Library performs runtime detection and automatic rerouting
  • When fallback is active: BF16→F32 conversion, F32 computation, F32→BF16 conversion (if needed)
  • Performance impact on fallback: conversion overhead, 2x memory bandwidth usage

Call Layers

  1. Prepare data (layouts, leading dimensions)
  2. Optional reordering for repeated use
  3. Configure dlp_metadata_t for fused post-ops
  4. Call GEMM or eltwise
  5. Optional de/reordering for outputs

Hardware Features

Targets AVX2/FMA3, AVX512, AVX512_VNNI, AVX512_BF16 on supported AMD CPUs.

Clone this wiki locally