You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A collection of AMD GPU programming examples targeting CDNA / RDNA
architectures (primarily gfx942 / MI300), covering hand-written GCN
assembly kernels, HIP C++ device code, and PyTorch/Triton extensions.
Legend
Tag
Meaning
[A]
Hand-written GCN assembly kernel (.s)
[H]
HIP / C++ / CUDA device code
[A/H]
Both hand-written assembly and HIP host code
+
Has Python / PyTorch / Triton interface (can run from Python directly)
Highlighted Examples
bandwidth_memread/[H] -- The go-to
memory bandwidth microbenchmark. Measures peak read-only and read+write
GPU memory bandwidth using float4 vectorized, non-temporal, persistent kernels.
Supports both ROCm and CUDA. Sweeps from ~78 KB to ~1.7 GB and reports GB/s
per size. Peak observed: ~4.56 TB/s read-only on MI308X (gfx942).
See the detailed README.
vector_add_asm/[A/H] -- A minimal but complete
hand-written GCN assembly kernel for C[i] = A[i] + B[i] on gfx942,
demonstrating persistent kernel launch, double LDS buffering with deep pipeline
fill, buffer_load_dword ... offen lds (async load to LDS), OOB-based control
flow (no exec mask), and vmcnt(3) pipelining.
See the detailed README.