Unit tests for OPUS (AI Operator Micro Std). Contains a host-only C++ test and GPU device-kernel tests compiled into a shared library (opus_device_test.so) via hipcc and loaded from Python via ctypes.
OPUS headers live under csrc/include/opus/; all kernel code uses #include "opus/opus.hpp".
op_tests/opus/
├── test_opus_basic.cpp # Host-only C++ test (no GPU)
├── build.sh # Builds test_opus_basic
├── device/ # GPU kernel tests (hipcc -> .so, loaded via ctypes)
│ ├── test_mfma_f16.cu # MFMA fp16/bf16 kernels
│ ├── test_mfma_f32.cu # MFMA fp32 kernels
│ ├── test_mfma_f8.cu # MFMA fp8/bf8 kernels
│ ├── test_mxfp.cu # MXFP8/MXFP4 kernels (gfx950 only)
│ ├── test_wmma_f16.cu # WMMA fp16/bf16 kernels (gfx1250 only)
│ ├── test_wmma_f32.cu # WMMA fp32 kernels (gfx1250 only)
│ ├── test_wmma_f8.cu # WMMA fp8/bf8 kernels (gfx1250 only)
│ ├── test_wmma_scale.cu # WMMA scaled f8f6f4/f4 kernels (gfx1250 only)
│ ├── test_mma_step_k.cu # tiled_mma_adaptor::step_k bf16 kernel
│ ├── test_vector_add.cu # Vector addition kernel
│ ├── test_async_load.cu # Async global->LDS->global copy kernel
│ ├── test_tr_load_f16.cu # LDS transpose load kernel (gfx950 only)
│ ├── test_dtype_convert.cu # FP32<->BF16/FP16/FP8/FP4 round-trip kernels
│ ├── test_load_store_if.cu # Predicated load/store + free function API tests
│ ├── test_numeric_limits.cu # opus::numeric_limits kernel
│ ├── test_finfo.cu # opus::finfo kernel
│ ├── test_mdiv.cu # opus::magic_div kernel
│ ├── test_workgroup_barrier.cu# Workgroup barrier kernel
│ ├── setup.py # Parallel hipcc build: 18 .cu -> .o -> .so
│ └── test_opus_device.py # Python test runner (builds .so, runs all tests)
├── run_tests.sh # Runs host test + device tests
└── README.md
Run both host and device tests with a single command:
./run_tests.sh # host test + device tests (requires ROCm + PyTorch)This builds and runs the C++ host test (test_opus_basic), then builds the device .so and runs all GPU kernel tests.
To run them individually:
./build.sh --test # host-only C++ test (no GPU needed)
python3 device/test_opus_device.py # device kernel tests onlyThe device test build applies several techniques from the OPUS best-practices guide
(csrc/include/opus/README.md) to minimize hipcc compile time:
| # | Optimization | Build time | run_tests.sh |
Speedup |
|---|---|---|---|---|
| 0 | Original (PyBind11 ext, sequential hipcc) | 18,400 ms | ~20 s | 1.0x |
| 1 | Replace PyBind11/torch with ctypes + extern "C" |
12,500 ms | ~14 s | 1.5x |
| 2 | Parallel compilation (each .cu -> .o, then link) | 3,950 ms | ~9 s | 4.7x |
| 3 | Split test_mfma.cu into 3 files (f16/f32/f8) |
2,670 ms | ~8 s | 6.9x |
| 4 | Add -D__HIPCC_RTC__ to reduce device-pass header parsing |
2,070 ms | ~7.5 s | 8.9x |
| 5 | Replace <hip/hip_runtime.h> with opus/hip_minimal.hpp |
1,740 ms | ~6.9 s | 10.5x |
-
ctypes instead of PyBind11 — The old
opus_device_test_ext.cpp(490 lines) included<torch/extension.h>and<pybind11/pybind11.h>, adding ~100K lines of C++ template headers to every compilation. Replaced with plainextern "C"host launchers in.cufiles, loaded viactypes.CDLLin Python. -
Parallel compilation —
setup.pycompiles each.cufile to a.oin parallel usingProcessPoolExecutor, then links all.ofiles into a single.so. This turns the build from sequential (sum of all file times) into parallel (bounded by the slowest file). -
MFMA file splitting — The original
test_mfma.cuhad 14 template instantiations and took ~3.9s alone (the parallel build bottleneck). Split intotest_mfma_f16.cu,test_mfma_f32.cu, andtest_mfma_f8.cuto balance the workload across cores. -
-D__HIPCC_RTC__— This flag tells the HIP headers to skip declarations not needed for runtime compilation, reducing preprocessing load on both host and device passes. Saves ~500ms per file on header parsing. -
opus/hip_minimal.hpp— A ~70-line header (csrc/include/opus/hip_minimal.hpp) that declares only the dozen HIP APIs the host launcher code actually uses (dim3,hipLaunchKernelGGL,hipMalloc,hipDeviceSynchronize, etc.), replacing<hip/hip_runtime.h>(~100K preprocessed lines) on the host pass. Combined with-D__HIPCC_RTC__for the device pass, this eliminates nearly all unnecessary header parsing.
Measured on MI355 (gfx950) with ROCm 7.1.1:
Phase Time
──────────────────────────────── ──────
Host build (hipcc test_opus_basic) 738 ms
Host run (13 unit tests) 12 ms
Device .so build (18 .cu, parallel) 625 ms
compile (18 parallel jobs) 599 ms
link (.o -> .so) 25 ms
Device tests (torch import + GPU) 1,800 ms
torch import + .so build ~800 ms
kernel execution (60+ tests) ~1,000 ms
──────────────────────────────── ──────
Total wall clock ~3.2 s
test_async_load.cu 119 ms
test_finfo.cu 124 ms
test_vector_add.cu 125 ms
test_numeric_limits.cu 128 ms
test_workgroup_barrier.cu 139 ms
test_wmma_f16.cu 152 ms
test_wmma_f32.cu 156 ms
test_mdiv.cu 161 ms
test_wmma_f8.cu 165 ms
test_wmma_scale.cu 172 ms
test_load_store_if.cu 187 ms
test_mxfp.cu 214 ms
test_dtype_convert.cu 282 ms
test_mma_step_k.cu 421 ms
test_tr_load_f16.cu 434 ms
test_mfma_f32.cu 522 ms
test_mfma_f8.cu 572 ms
test_mfma_f16.cu 587 ms <-- critical path
link 25 ms
Before opus.hpp compile-time optimizations, the MFMA-heavy files took 750-890ms each (930ms total parallel build). The optimizations reduced these by 1.3-2.1x, bringing the parallel build from 930ms to 625ms (33% faster).
All GPU kernel tests live in device/ and are compiled into opus_device_test.so.
To add a new kernel test (e.g. my_kernel):
Add device/test_my_kernel.cu. The file uses conditional compilation to separate
device and host passes:
// device/test_my_kernel.cu
#ifdef __HIP_DEVICE_COMPILE__
// ── Device pass: use opus.hpp + builtins, no hip_runtime.h ──
#include "opus/opus.hpp"
__global__ void my_kernel(const float* in, float* out, int n) {
int tid = __builtin_amdgcn_workitem_id_x()
+ __builtin_amdgcn_workgroup_id_x() * __builtin_amdgcn_workgroup_size_x();
if (tid < n) out[tid] = in[tid] * 2.0f;
}
#else
// ── Host pass: minimal HIP header for kernel launch API ──
// #include <hip/hip_runtime.h> // replaced by opus/hip_minimal.hpp for faster builds
#include "opus/hip_minimal.hpp"
__global__ void my_kernel(const float* in, float* out, int n);
extern "C" void run_my_kernel(const void* d_in, void* d_out, int n) {
dim3 grid((n + 255) / 256), block(256);
hipLaunchKernelGGL(my_kernel, grid, block, 0, nullptr,
(const float*)d_in, (float*)d_out, n);
hipDeviceSynchronize();
}
#endifAdd the filename to _CU_SOURCES in device/setup.py:
_CU_SOURCES = [
...
"test_my_kernel.cu",
]In device/test_opus_device.py:
-
Add a method to the
OpusDeviceLibclass:def run_my_kernel(self, In, Out, n): fn = self._lib.run_my_kernel fn.restype = None fn.argtypes = [_VP, _VP, _I] fn(self._ptr(In), self._ptr(Out), n)
-
Add a test function and call it from
main():def test_my_kernel(mod): # Create input tensors, call mod.run_my_kernel(...), compare with reference ...
./run_tests.sh| Test | Variant | OPUS APIs exercised | Arch |
|---|---|---|---|
test_mfma_f16 |
32x32x8 fp16/bf16 | make_tiled_mma, mfma_adaptor_swap_ab, partition_layout_a/b/c, make_gmem, cast |
gfx942 |
test_mfma_f16 |
16x16x16 fp16/bf16 | (same as above) | gfx942 |
test_mfma_f16 |
32x32x16 fp16/bf16 | (same, uses base 32x32x8 with K-loop on gfx942; native on gfx950) | gfx942 + gfx950 |
test_mfma_f16 |
16x16x32 fp16/bf16 | (same, uses base 16x16x16 with K-loop on gfx942; native on gfx950) | gfx942 + gfx950 |
test_mfma_f32 |
32x32x2 f32 | make_tiled_mma, partition_layout_a/b/c, make_gmem |
gfx942 |
test_mfma_f32 |
16x16x4 f32 | (same as above) | gfx942 |
test_mfma_f8 |
32x32x16 fp8/bf8 | make_tiled_mma, partition_layout_a/b/c, make_gmem (fp32 output, no cast) |
gfx942 + gfx950 |
test_mfma_f8 |
16x16x32 fp8/bf8 | (same as above) | gfx942 + gfx950 |
test_mxfp |
mxfp8_32x32x64 | mfma<fp8_t,fp8_t,fp32_t,32,32,64> (scaled overload), direct data-layout load/store |
gfx950 |
test_mxfp |
mxfp8_16x16x128 | mfma<fp8_t,fp8_t,fp32_t,16,16,128> (scaled overload) |
gfx950 |
test_mxfp |
mxfp4_32x32x64 | mfma<fp4_t,fp4_t,fp32_t,32,32,64> (scaled overload), fp4x2 packed nibble handling |
gfx950 |
test_mxfp |
mxfp4_16x16x128 | mfma<fp4_t,fp4_t,fp32_t,16,16,128> (scaled overload) |
gfx950 |
test_vector_add |
— | make_gmem, vectorized load<N> / store<N> |
all |
test_async_load |
— | make_gmem, gmem::async_load, s_waitcnt_vmcnt |
all |
test_tr_load_f16 |
32x16 fp16 | tr_load, async_load, make_tiled_mma 32×32×16, partition_layout_b |
gfx950 |
test_dtype_convert |
fp32<->bf16 scalar | cast<bf16_t>(fp32_t) RNE (explicit 0_I on gfx942, hw default on gfx950) |
all |
test_dtype_convert |
fp32<->bf16 x4 vec | cast<bf16_t>(fp32x4_t) generic vectorized |
all |
test_dtype_convert |
fp32<->fp16 scalar | cast<fp16_t>(fp32_t) / cast<fp32_t>(fp16_t) |
all |
test_dtype_convert |
fp32<->fp16 x4 vec | cast<fp16_t>(fp32x4_t) generic vectorized |
all |
test_dtype_convert |
fp32<->fp8 scalar | cast<fp8_t>(fp32_t) via cvt_pk_fp8_f32 lo / cvt_f32_fp8 |
gfx942 + gfx950 |
test_dtype_convert |
fp32<->fp8 x2 pk | cast<fp8_t>(fp32x2_t) packed x2 |
gfx942 + gfx950 |
test_dtype_convert |
fp32<->fp8 x4 pk | cast<fp8_t>(fp32x4_t) packed x4 |
gfx942 + gfx950 |
test_dtype_convert |
fp32<->fp8 x8 fold | cast<fp8_t>(fp32x8_t) auto-fold 2x4 + unfold_from_container |
gfx942 + gfx950 |
test_dtype_convert |
fp32<->fp4 x2 pk | cast<fp4_t>(fp32x2_t) packed x2, e2m1 |
gfx950 |
test_dtype_convert |
fp32<->fp4 x4 pk | cast<fp4_t>(fp32x4_t) packed x4, e2m1 |
gfx950 |
test_dtype_convert |
fp32<->fp4 x8 pk | cast<fp4_t>(fp32x8_t) packed x8, e2m1 |
gfx950 |
test_load_store_if |
predicated_copy | gmem::load_if, gmem::store_if, free functions opus::load_if/opus::store_if, layout_linear::operator+ |
all |
test_load_store_if |
predicated_copy_2d | 2D layout load_if/store_if with multi-index (i_row, i_col) predicate, unfold_x_stride, unfold_p_coord |
all |
test_load_store_if |
free_func_vector_add | Free functions opus::load/opus::store, is_gmem_v/is_mem_v type traits |
all |
test_load_store_if |
predicated_async_load | gmem::async_load_if, free function opus::async_load_if, layout_linear::operator+ |
all |
test_numeric_limits |
all types | opus::numeric_limits<T> for fp32/fp16/bf16/fp8/bf8/i32/i16/i8/u8 |
all |
test_finfo |
all float types | opus::finfo<T> (eps/max/min/tiny/bits) for fp32/fp16/bf16/fp8/bf8/fp4/e8m0 |
all |
test_mdiv |
11 divisors | opus::magic_div integer division by magic multiply |
all |
test_wmma_scale |
16x16x128 fp8 BX32/BX16 | wmma<fp8_t,...,16,16,128> scaled overload, per-lane E8M0 scale |
gfx1250 |
test_wmma_scale |
16x16x128 fp4 BX32/BX16 | wmma<fp4_t,...,16,16,128> scaled overload via f8f6f4 (fmt=4) |
gfx1250 |
test_wmma_scale |
32x16x128 fp4 BX32/BX16 | wmma<fp4_t,...,32,16,128> scaled overload (dedicated f4 inst) |
gfx1250 |
test_wmma_scale |
tiled 16x16x128 fp8 1x1 | make_tiled_mma + wmma_adaptor_swap_ab, scaled, 1 wave |
gfx1250 |
test_wmma_scale |
tiled 16x16x128 fp8 2x2 | make_tiled_mma, 4 waves (32x32 block), scaled |
gfx1250 |
test_wmma_scale |
tiled 16x16x128 fp8 4x1 | make_tiled_mma, 4 waves (64x16 block), scaled |
gfx1250 |
test_wmma_scale |
per-lane scale fp8 | Random E8M0 per m-row/n-col, bitwise exact | gfx1250 |
test_mma_step_k |
32x32x128 bf16 step_k | make_tiled_mma, step_k |
gfx942 + gfx950 |
test_workgroup_barrier |
cumulative + streamk | opus::workgroup_barrier cross-workgroup synchronization |
all |
Total: 60+ test calls (14 MFMA + 4 MXFP + 11 WMMA + 10 WMMA-scale + 1 mma_step_k + 1 vector_add + 1 async_load + 1 tr_load + 11 dtype_convert + 4 load_store_if + 9 numeric_limits + 7 finfo + 11 mdiv + 4 workgroup_barrier).
- The extension compiles with
--offload-arch=<detected>(seedevice/setup.py) to target only the current GPU and speed up builds. - MFMA tests are runtime-gated by GPU architecture (
gcnArchName). Tests for unsupported architectures are automatically skipped.- 32x32x8 and 16x16x16 variants: gfx942 only.
- 32x32x16 and 16x16x32 fp16/bf16 variants: gfx942 (via step-K decomposition) + gfx950 (native instruction).
- 32x32x16 and 16x16x32 fp8/bf8 variants: gfx942 + gfx950 (native instruction on both). Output is raw fp32 accumulator.
- BF16 rounding:
opus::cast<bf16_t>default rounding mode differs by architecture:- gfx942: default is truncation (rm=2). Pass
0_Ias 2nd argument to select round-to-nearest-even (RNE). - gfx950: default is already RNE (hardware). No 2nd argument needed.
The dtype_convert bf16 test and MFMA bf16 tests both use RNE so that the kernel result matches PyTorch
.to(bfloat16).
- gfx942: default is truncation (rm=2). Pass
- FP8 =
float8_e4m3fnuz(gfx942) /float8_e4m3fn(gfx950), BF8 =float8_e5m2fnuz(gfx942) /float8_e5m2(gfx950). - FP4 = E2M1 (4-bit: 1 sign, 2 exponent, 1 mantissa). Representable values: +/-{0, 0.5, 1, 1.5, 2, 3, 4, 6}. gfx950 only.
- MXFP (unified into
struct mfma, scaledoperator()overload): gfx950-only__builtin_amdgcn_mfma_scale_f32_{32x32x64,16x16x128}_f8f6f4intrinsics. Support MXFP8 (fp8fp8) and MXFP4 (fp4fp4) with E8M0 block exponent scaling. Tests usescale=127(2^0=1.0, no scaling) and verifyC = A @ B(standard matmul, not swap_ab). The data layout follows the CDNA4 Matrix Core specification. - WMMA Scale (unified into
struct wmma, BX32/BX16 scaledoperator()overloads): gfx1250-only scaled WMMA intrinsics. Supports 16x16x128 f8f6f4 (fp8 via fmt=0, fp4 via fmt=4) and 32x16x128 dedicated f4. Scale is per-lane E8M0 exponent (scale_sel=0reads from lanes 0-15,=1from lanes 16-31). Tests verify bothscale=127(no scaling) and random per-lane scales in [122..133], plus multi-wave tiled_mma configurations (1x1, 2x2, 4x1). test_opus_device.pydoes a fresh build on every run (cleans previous.so) to ensure changes are picked up.