Name	Name	Last commit message	Last commit date
parent directory ..
device	device
README.md	README.md
build.sh	build.sh
run_tests.sh	run_tests.sh
test_opus_basic.cpp	test_opus_basic.cpp

OPUS Tests

Unit tests for OPUS (AI Operator Micro Std). Contains a host-only C++ test and GPU device-kernel tests compiled into a shared library (opus_device_test.so) via hipcc and loaded from Python via ctypes.

OPUS headers live under csrc/include/opus/; all kernel code uses #include "opus/opus.hpp".

Folder structure

op_tests/opus/
├── test_opus_basic.cpp          # Host-only C++ test (no GPU)
├── build.sh                     # Builds test_opus_basic
├── device/                      # GPU kernel tests (hipcc -> .so, loaded via ctypes)
│   ├── test_mfma_f16.cu         # MFMA fp16/bf16 kernels
│   ├── test_mfma_f32.cu         # MFMA fp32 kernels
│   ├── test_mfma_f8.cu          # MFMA fp8/bf8 kernels
│   ├── test_mxfp.cu             # MXFP8/MXFP4 kernels (gfx950 only)
│   ├── test_wmma_f16.cu         # WMMA fp16/bf16 kernels (gfx1250 only)
│   ├── test_wmma_f32.cu         # WMMA fp32 kernels (gfx1250 only)
│   ├── test_wmma_f8.cu          # WMMA fp8/bf8 kernels (gfx1250 only)
│   ├── test_wmma_scale.cu       # WMMA scaled f8f6f4/f4 kernels (gfx1250 only)
│   ├── test_mma_step_k.cu       # tiled_mma_adaptor::step_k bf16 kernel
│   ├── test_vector_add.cu       # Vector addition kernel
│   ├── test_async_load.cu       # Async global->LDS->global copy kernel
│   ├── test_tr_load_f16.cu      # LDS transpose load kernel (gfx950 only)
│   ├── test_dtype_convert.cu    # FP32<->BF16/FP16/FP8/FP4 round-trip kernels
│   ├── test_load_store_if.cu    # Predicated load/store + free function API tests
│   ├── test_numeric_limits.cu   # opus::numeric_limits kernel
│   ├── test_finfo.cu            # opus::finfo kernel
│   ├── test_mdiv.cu             # opus::magic_div kernel
│   ├── test_workgroup_barrier.cu# Workgroup barrier kernel
│   ├── setup.py                 # Parallel hipcc build: 18 .cu -> .o -> .so
│   └── test_opus_device.py      # Python test runner (builds .so, runs all tests)
├── run_tests.sh                 # Runs host test + device tests
└── README.md

Running tests

Run both host and device tests with a single command:

./run_tests.sh   # host test + device tests (requires ROCm + PyTorch)

This builds and runs the C++ host test (test_opus_basic), then builds the device .so and runs all GPU kernel tests.

To run them individually:

./build.sh --test                  # host-only C++ test (no GPU needed)
python3 device/test_opus_device.py # device kernel tests only

Build-time optimizations

The device test build applies several techniques from the OPUS best-practices guide (csrc/include/opus/README.md) to minimize hipcc compile time:

#	Optimization	Build time	`run_tests.sh`	Speedup
0	Original (PyBind11 ext, sequential hipcc)	18,400 ms	~20 s	1.0x
1	Replace PyBind11/torch with ctypes + `extern "C"`	12,500 ms	~14 s	1.5x
2	Parallel compilation (each .cu -> .o, then link)	3,950 ms	~9 s	4.7x
3	Split `test_mfma.cu` into 3 files (f16/f32/f8)	2,670 ms	~8 s	6.9x
4	Add `-D__HIPCC_RTC__` to reduce device-pass header parsing	2,070 ms	~7.5 s	8.9x
5	Replace `<hip/hip_runtime.h>` with `opus/hip_minimal.hpp`	1,740 ms	~6.9 s	10.5x

What each optimization does

ctypes instead of PyBind11 — The old opus_device_test_ext.cpp (490 lines) included <torch/extension.h> and <pybind11/pybind11.h>, adding ~100K lines of C++ template headers to every compilation. Replaced with plain extern "C" host launchers in .cu files, loaded via ctypes.CDLL in Python.
Parallel compilation — setup.py compiles each .cu file to a .o in parallel using ProcessPoolExecutor, then links all .o files into a single .so. This turns the build from sequential (sum of all file times) into parallel (bounded by the slowest file).
MFMA file splitting — The original test_mfma.cu had 14 template instantiations and took ~3.9s alone (the parallel build bottleneck). Split into test_mfma_f16.cu, test_mfma_f32.cu, and test_mfma_f8.cu to balance the workload across cores.
-D__HIPCC_RTC__ — This flag tells the HIP headers to skip declarations not needed for runtime compilation, reducing preprocessing load on both host and device passes. Saves ~500ms per file on header parsing.
opus/hip_minimal.hpp — A ~70-line header (csrc/include/opus/hip_minimal.hpp) that declares only the dozen HIP APIs the host launcher code actually uses (dim3, hipLaunchKernelGGL, hipMalloc, hipDeviceSynchronize, etc.), replacing <hip/hip_runtime.h> (~100K preprocessed lines) on the host pass. Combined with -D__HIPCC_RTC__ for the device pass, this eliminates nearly all unnecessary header parsing.

Current time breakdown (`time ./run_tests.sh`)

Measured on MI355 (gfx950) with ROCm 7.1.1:

Phase                              Time
────────────────────────────────  ──────
Host build (hipcc test_opus_basic)    738 ms
Host run (13 unit tests)               12 ms
Device .so build (18 .cu, parallel)   625 ms
  compile (18 parallel jobs)           599 ms
  link (.o -> .so)                      25 ms
Device tests (torch import + GPU)   1,800 ms
  torch import + .so build              ~800 ms
  kernel execution (60+ tests)        ~1,000 ms
────────────────────────────────  ──────
Total wall clock                    ~3.2 s

Per-file device compile times (MI355 / gfx950, 18 parallel jobs)

test_async_load.cu         119 ms
test_finfo.cu              124 ms
test_vector_add.cu         125 ms
test_numeric_limits.cu     128 ms
test_workgroup_barrier.cu  139 ms
test_wmma_f16.cu           152 ms
test_wmma_f32.cu           156 ms
test_mdiv.cu               161 ms
test_wmma_f8.cu            165 ms
test_wmma_scale.cu         172 ms
test_load_store_if.cu      187 ms
test_mxfp.cu               214 ms
test_dtype_convert.cu      282 ms
test_mma_step_k.cu         421 ms
test_tr_load_f16.cu        434 ms
test_mfma_f32.cu           522 ms
test_mfma_f8.cu            572 ms
test_mfma_f16.cu           587 ms  <-- critical path
link                        25 ms

Before opus.hpp compile-time optimizations, the MFMA-heavy files took 750-890ms each (930ms total parallel build). The optimizations reduced these by 1.3-2.1x, bringing the parallel build from 930ms to 625ms (33% faster).

How to add a new device test

All GPU kernel tests live in device/ and are compiled into opus_device_test.so. To add a new kernel test (e.g. my_kernel):

1. Create the kernel source

Add device/test_my_kernel.cu. The file uses conditional compilation to separate device and host passes:

// device/test_my_kernel.cu
#ifdef __HIP_DEVICE_COMPILE__
// ── Device pass: use opus.hpp + builtins, no hip_runtime.h ──
#include "opus/opus.hpp"

__global__ void my_kernel(const float* in, float* out, int n) {
    int tid = __builtin_amdgcn_workitem_id_x()
            + __builtin_amdgcn_workgroup_id_x() * __builtin_amdgcn_workgroup_size_x();
    if (tid < n) out[tid] = in[tid] * 2.0f;
}

#else
// ── Host pass: minimal HIP header for kernel launch API ──
// #include <hip/hip_runtime.h>   // replaced by opus/hip_minimal.hpp for faster builds
#include "opus/hip_minimal.hpp"

__global__ void my_kernel(const float* in, float* out, int n);

extern "C" void run_my_kernel(const void* d_in, void* d_out, int n) {
    dim3 grid((n + 255) / 256), block(256);
    hipLaunchKernelGGL(my_kernel, grid, block, 0, nullptr,
                       (const float*)d_in, (float*)d_out, n);
    hipDeviceSynchronize();
}
#endif

2. Register the source in setup.py

Add the filename to _CU_SOURCES in device/setup.py:

_CU_SOURCES = [
    ...
    "test_my_kernel.cu",
]

3. Add the ctypes wrapper and Python test

In device/test_opus_device.py:

Add a method to the OpusDeviceLib class:

def run_my_kernel(self, In, Out, n):
    fn = self._lib.run_my_kernel
    fn.restype = None
    fn.argtypes = [_VP, _VP, _I]
    fn(self._ptr(In), self._ptr(Out), n)

Add a test function and call it from main():

def test_my_kernel(mod):
    # Create input tensors, call mod.run_my_kernel(...), compare with reference
    ...

4. Verify

./run_tests.sh

Device test summary

Test	Variant	OPUS APIs exercised	Arch
`test_mfma_f16`	32x32x8 fp16/bf16	`make_tiled_mma`, `mfma_adaptor_swap_ab`, `partition_layout_a/b/c`, `make_gmem`, `cast`	gfx942
`test_mfma_f16`	16x16x16 fp16/bf16	(same as above)	gfx942
`test_mfma_f16`	32x32x16 fp16/bf16	(same, uses base 32x32x8 with K-loop on gfx942; native on gfx950)	gfx942 + gfx950
`test_mfma_f16`	16x16x32 fp16/bf16	(same, uses base 16x16x16 with K-loop on gfx942; native on gfx950)	gfx942 + gfx950
`test_mfma_f32`	32x32x2 f32	`make_tiled_mma`, `partition_layout_a/b/c`, `make_gmem`	gfx942
`test_mfma_f32`	16x16x4 f32	(same as above)	gfx942
`test_mfma_f8`	32x32x16 fp8/bf8	`make_tiled_mma`, `partition_layout_a/b/c`, `make_gmem` (fp32 output, no cast)	gfx942 + gfx950
`test_mfma_f8`	16x16x32 fp8/bf8	(same as above)	gfx942 + gfx950
`test_mxfp`	mxfp8_32x32x64	`mfma<fp8_t,fp8_t,fp32_t,32,32,64>` (scaled overload), direct data-layout load/store	gfx950
`test_mxfp`	mxfp8_16x16x128	`mfma<fp8_t,fp8_t,fp32_t,16,16,128>` (scaled overload)	gfx950
`test_mxfp`	mxfp4_32x32x64	`mfma<fp4_t,fp4_t,fp32_t,32,32,64>` (scaled overload), fp4x2 packed nibble handling	gfx950
`test_mxfp`	mxfp4_16x16x128	`mfma<fp4_t,fp4_t,fp32_t,16,16,128>` (scaled overload)	gfx950
`test_vector_add`	—	`make_gmem`, vectorized `load<N>` / `store<N>`	all
`test_async_load`	—	`make_gmem`, `gmem::async_load`, `s_waitcnt_vmcnt`	all
`test_tr_load_f16`	32x16 fp16	`tr_load`, `async_load`, `make_tiled_mma` 32×32×16, `partition_layout_b`	gfx950
`test_dtype_convert`	fp32<->bf16 scalar	`cast<bf16_t>(fp32_t)` RNE (explicit `0_I` on gfx942, hw default on gfx950)	all
`test_dtype_convert`	fp32<->bf16 x4 vec	`cast<bf16_t>(fp32x4_t)` generic vectorized	all
`test_dtype_convert`	fp32<->fp16 scalar	`cast<fp16_t>(fp32_t)` / `cast<fp32_t>(fp16_t)`	all
`test_dtype_convert`	fp32<->fp16 x4 vec	`cast<fp16_t>(fp32x4_t)` generic vectorized	all
`test_dtype_convert`	fp32<->fp8 scalar	`cast<fp8_t>(fp32_t)` via `cvt_pk_fp8_f32` lo / `cvt_f32_fp8`	gfx942 + gfx950
`test_dtype_convert`	fp32<->fp8 x2 pk	`cast<fp8_t>(fp32x2_t)` packed x2	gfx942 + gfx950
`test_dtype_convert`	fp32<->fp8 x4 pk	`cast<fp8_t>(fp32x4_t)` packed x4	gfx942 + gfx950
`test_dtype_convert`	fp32<->fp8 x8 fold	`cast<fp8_t>(fp32x8_t)` auto-fold 2x4 + `unfold_from_container`	gfx942 + gfx950
`test_dtype_convert`	fp32<->fp4 x2 pk	`cast<fp4_t>(fp32x2_t)` packed x2, e2m1	gfx950
`test_dtype_convert`	fp32<->fp4 x4 pk	`cast<fp4_t>(fp32x4_t)` packed x4, e2m1	gfx950
`test_dtype_convert`	fp32<->fp4 x8 pk	`cast<fp4_t>(fp32x8_t)` packed x8, e2m1	gfx950
`test_load_store_if`	predicated_copy	`gmem::load_if`, `gmem::store_if`, free functions `opus::load_if`/`opus::store_if`, `layout_linear::operator+`	all
`test_load_store_if`	predicated_copy_2d	2D layout `load_if`/`store_if` with multi-index `(i_row, i_col)` predicate, `unfold_x_stride`, `unfold_p_coord`	all
`test_load_store_if`	free_func_vector_add	Free functions `opus::load`/`opus::store`, `is_gmem_v`/`is_mem_v` type traits	all
`test_load_store_if`	predicated_async_load	`gmem::async_load_if`, free function `opus::async_load_if`, `layout_linear::operator+`	all
`test_numeric_limits`	all types	`opus::numeric_limits<T>` for fp32/fp16/bf16/fp8/bf8/i32/i16/i8/u8	all
`test_finfo`	all float types	`opus::finfo<T>` (eps/max/min/tiny/bits) for fp32/fp16/bf16/fp8/bf8/fp4/e8m0	all
`test_mdiv`	11 divisors	`opus::magic_div` integer division by magic multiply	all
`test_wmma_scale`	16x16x128 fp8 BX32/BX16	`wmma<fp8_t,...,16,16,128>` scaled overload, per-lane E8M0 scale	gfx1250
`test_wmma_scale`	16x16x128 fp4 BX32/BX16	`wmma<fp4_t,...,16,16,128>` scaled overload via f8f6f4 (fmt=4)	gfx1250
`test_wmma_scale`	32x16x128 fp4 BX32/BX16	`wmma<fp4_t,...,32,16,128>` scaled overload (dedicated f4 inst)	gfx1250
`test_wmma_scale`	tiled 16x16x128 fp8 1x1	`make_tiled_mma` + `wmma_adaptor_swap_ab`, scaled, 1 wave	gfx1250
`test_wmma_scale`	tiled 16x16x128 fp8 2x2	`make_tiled_mma`, 4 waves (32x32 block), scaled	gfx1250
`test_wmma_scale`	tiled 16x16x128 fp8 4x1	`make_tiled_mma`, 4 waves (64x16 block), scaled	gfx1250
`test_wmma_scale`	per-lane scale fp8	Random E8M0 per m-row/n-col, bitwise exact	gfx1250
`test_mma_step_k`	32x32x128 bf16 step_k	`make_tiled_mma`, `step_k`	gfx942 + gfx950
`test_workgroup_barrier`	cumulative + streamk	`opus::workgroup_barrier` cross-workgroup synchronization	all

Total: 60+ test calls (14 MFMA + 4 MXFP + 11 WMMA + 10 WMMA-scale + 1 mma_step_k + 1 vector_add + 1 async_load + 1 tr_load + 11 dtype_convert + 4 load_store_if + 9 numeric_limits + 7 finfo + 11 mdiv + 4 workgroup_barrier).

Notes

The extension compiles with --offload-arch=<detected> (see device/setup.py) to target only the current GPU and speed up builds.
MFMA tests are runtime-gated by GPU architecture (gcnArchName). Tests for unsupported architectures are automatically skipped.
- 32x32x8 and 16x16x16 variants: gfx942 only.
- 32x32x16 and 16x16x32 fp16/bf16 variants: gfx942 (via step-K decomposition) + gfx950 (native instruction).
- 32x32x16 and 16x16x32 fp8/bf8 variants: gfx942 + gfx950 (native instruction on both). Output is raw fp32 accumulator.
BF16 rounding: opus::cast<bf16_t> default rounding mode differs by architecture:
- gfx942: default is truncation (rm=2). Pass 0_I as 2nd argument to select round-to-nearest-even (RNE).
- gfx950: default is already RNE (hardware). No 2nd argument needed. The dtype_convert bf16 test and MFMA bf16 tests both use RNE so that the kernel result matches PyTorch .to(bfloat16).
FP8 = float8_e4m3fnuz (gfx942) / float8_e4m3fn (gfx950), BF8 = float8_e5m2fnuz (gfx942) / float8_e5m2 (gfx950).
FP4 = E2M1 (4-bit: 1 sign, 2 exponent, 1 mantissa). Representable values: +/-{0, 0.5, 1, 1.5, 2, 3, 4, 6}. gfx950 only.
MXFP (unified into struct mfma, scaled operator() overload): gfx950-only __builtin_amdgcn_mfma_scale_f32_{32x32x64,16x16x128}_f8f6f4 intrinsics. Support MXFP8 (fp8fp8) and MXFP4 (fp4fp4) with E8M0 block exponent scaling. Tests use scale=127 (2^0=1.0, no scaling) and verify C = A @ B (standard matmul, not swap_ab). The data layout follows the CDNA4 Matrix Core specification.
WMMA Scale (unified into struct wmma, BX32/BX16 scaled operator() overloads): gfx1250-only scaled WMMA intrinsics. Supports 16x16x128 f8f6f4 (fp8 via fmt=0, fp4 via fmt=4) and 32x16x128 dedicated f4. Scale is per-lane E8M0 exponent (scale_sel=0 reads from lanes 0-15, =1 from lanes 16-31). Tests verify both scale=127 (no scaling) and random per-lane scales in [122..133], plus multi-wave tiled_mma configurations (1x1, 2x2, 4x1).
test_opus_device.py does a fresh build on every run (cleans previous .so) to ensure changes are picked up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

OPUS Tests

Folder structure

Running tests

Build-time optimizations

What each optimization does

Current time breakdown (`time ./run_tests.sh`)

Per-file device compile times (MI355 / gfx950, 18 parallel jobs)

How to add a new device test

1. Create the kernel source

2. Register the source in setup.py

3. Add the ctypes wrapper and Python test

4. Verify

Device test summary

Notes

FilesExpand file tree

opus

Directory actions

More options

Directory actions

More options

Latest commit

History

opus

Folders and files

parent directory

README.md

OPUS Tests

Folder structure

Running tests

Build-time optimizations

What each optimization does

Current time breakdown (time ./run_tests.sh)

Per-file device compile times (MI355 / gfx950, 18 parallel jobs)

How to add a new device test

1. Create the kernel source

2. Register the source in setup.py

3. Add the ctypes wrapper and Python test

4. Verify

Device test summary

Notes

Current time breakdown (`time ./run_tests.sh`)