-
Notifications
You must be signed in to change notification settings - Fork 5
Jit
AOCL-DLP uses Just In Time (JIT) compilation to generate optimized code for specific matrix sizes and data types at runtime. This approach allows the library to produce highly efficient implementations tailored to the exact parameters of the GEMM operations being performed.
- Kernel Generation: When a specific operation is requested, AOCL-DLP analyzes the parameters and generates a tailored kernel optimized for those parameters.
- Caching: Once a kernel is generated, it can be cached for future use, reducing the overhead of recompilation.
- Dynamic Optimization: The JIT compiler can apply various optimization techniques based on the current execution context, such as loop unrolling, vectorization, and more.
- Performance: By generating optimized code on the fly, JIT compilation can significantly improve the performance of operations, especially for non-standard configurations.
- Flexibility: JIT allows for greater flexibility in supporting a wide range of hardware and software configurations without the need for extensive pre-compilation.
- Reduced Latency: For workloads with varying parameters, JIT can reduce latency by avoiding the need to recompile code for each unique configuration.
AOCL-DLP leverages the Xbyak JIT assembler (currently v7.36.1) to generate optimized assembly code on the fly. Xbyak provides a high-level C++ interface for writing JIT-compiled code, allowing developers to focus on the algorithm rather than the intricacies of assembly language.
- Parameter Specification: When a GEMM operation is requested, the user specifies the matrix dimensions, data types, and any additional parameters (e.g., post-operations).
- Code Generation: AOCL-DLP uses Xbyak to generate assembly code optimized for the specified parameters. This code is tailored to leverage the specific capabilities of the underlying hardware (e.g., AVX2, AVX512).
- Compilation: The generated assembly code is compiled into machine code at runtime.
- Execution: The compiled code is executed to perform the GEMM operation, providing high performance for the specific use case.
- Caching: To avoid the overhead of regenerating code for the same parameters, AOCL-DLP caches the generated code for reuse in future operations with identical parameters.
In addition to the GEMM micro-kernels, F32 GEMM also uses JIT-generated pack-B (B-matrix packing) kernels, available for both AVX-512 and AVX2 paths. These complement the existing micro-kernel JIT so that the data-reordering step is also tailored to the runtime parameters and target ISA.
JIT-generated kernels maintain a frame pointer (RBP). This produces unwindable stacks for the runtime-generated code, so profilers such as perf can attribute samples and reconstruct correct call stacks through JIT kernels — useful when debugging or profiling the generated code described below.
To dump the JIT generated code for inspection or debugging purposes:
Method 1: Build flag
cmake -DCMAKE_CXX_FLAGS="-DDLP_DUMP_JIT_CODE" ...Method 2: Source modification
Add #define DLP_DUMP_JIT_CODE at the top of src/jit/amdzen/amdzen_generator.cc before building.
Dumped files are created in the current working directory with names like:
-
jit_kernel_16x64.bin(GEMM kernel for MR=16, NR=64) -
jit_gemv_n1_kernel_16x5.bin(GEMV N=1, MR=16, config index 5) -
jit_gemv_m1_kernel_32x2.bin(GEMV M=1, NR=32, config index 2)
To disassemble:
objdump -D -b binary -m i386:x86-64 jit_kernel_16x64.binGetting Started
User Guides
- Library Overview
- GEMM Guide
- Batch GEMM Guide
- Post-Operations
- Eltwise Operations
- Quantization
- API Lifecycle
Performance & Config
Testing & Benchmarking
Developer Guides
Reference