A minimal but complete example of a hand-written AMDGPU assembly kernel that
performs C[i] = A[i] + B[i] on float32 arrays, using the
buffer_load_dword ... offen lds instruction to load data directly from
global memory into LDS, bypassing VGPRs entirely.
The kernel demonstrates several GPU programming techniques:
- Persistent kernel: grid size = number of CUs (detected at runtime); each workgroup processes multiple chunks via a grid-stride loop.
- Double LDS buffering with deep pipeline fill: both buffers are loaded in the prologue; each loop half reads one buffer while prefetching it for 2 iterations ahead.
- OOB-based control flow: SRDs use
num_records = N * 4so that out-of-bounds buffer loads return 0 and stores are silently dropped, eliminating all exec mask manipulation. vmcnt(3)accounting: carefully chosen wait count that drains exactly the previous half-iteration's 2 prefetch loads + 1 store, keeping the current half's operations fully in flight.
| File | Description |
|---|---|
vector_add_kernel.s |
GCN assembly kernel (gfx942, CDNA3) |
main.cpp |
HIP host code -- loads .hsaco, launches kernel, verifies results |
build.sh |
Assembles the .s and compiles the host into a single executable |
Must be run inside a ROCm docker container with gfx942 hardware available:
# Launch docker (adjust image as needed)
docker run -it --privileged --network=host \
--device=/dev/kfd --device=/dev/dri --group-add video \
-v /home/$USER:/dockerx -v /mnt/raid0:/raid0 \
rocm/atom:nightly_202601190317
# Inside docker
cd /raid0/<path>/gcnasm/vector_add_asm
bash build.sh
./vector_add_asm.exeBuild steps performed by build.sh:
- Assemble
.sto.hsacocode object:clang++ -x assembler -target amdgcn--amdhsa -mcpu=gfx942 vector_add_kernel.s -o vector_add_kernel.hsaco - Compile host code:
hipcc main.cpp -o vector_add_asm.exe
Global Memory (A,B)
|
| buffer_load_dword ... offen lds (VGPR bypassed, OOB → 0)
v
LDS
|
| ds_read_b32
v
VGPRs
|
| v_add_f32
v
VGPRs
|
| buffer_store_dword ... offen (OOB → silently dropped)
v
Global Memory (C)
PROLOGUE (fill both buffers):
buffer_load A[iter0] -> buf0 vmcnt = 1
buffer_load B[iter0] -> buf0 vmcnt = 2
buffer_load A[iter1] -> buf1 vmcnt = 3
buffer_load B[iter1] -> buf1 vmcnt = 4
s_waitcnt vmcnt(2) buf0 ready, buf1 in flight
MAIN LOOP (each half-iteration, steady state entry vmcnt = 3):
┌── ds_read A, B from current buffer
│ s_waitcnt lgkmcnt(0)
│
│ buffer_load A[idx+2*stride] ──┐ prefetch into SAME buffer
│ buffer_load B[idx+2*stride] ──┘ (we already read it above)
│
│ v_add_f32 A, B
│ buffer_store C[idx] (OOB → dropped)
│
│ idx += stride
│ s_cbranch_vccz L_done (exit when all lanes done)
└── s_waitcnt vmcnt(3) drain prev half's loads + store
swap to other buffer, repeat
Key design points:
- Prologue fills both buffers so the loop body never stalls on the first
read.
vmcnt(2)after 4 loads drains the oldest 2 (buf0), leaving buf1 in flight. - Prefetch goes into the same buffer just read (not the alternate), loading
data for
idx + 2*stride(2 half-iterations ahead). vmcnt(3)in the loop drains exactly the previous half's 2 prefetch loads + 1 store, keeping the current half's 2 prefetches + store in flight. See section 13 for the detailed FIFO trace.
On gfx942 (CDNA3), when a scalar base address (saddr) is provided,
global_load_dword / global_store_dword take a single VGPR as the 32-bit
signed byte offset -- not a VGPR pair like older GFX9 targets (gfx900/gfx906):
; gfx942 -- single VGPR offset
global_load_dword v_dst, v_offset, s[base:base+1]
; gfx900 -- VGPR pair (only lower half used as offset)
global_load_dword v_dst, v[offset:offset+1], s[base:base+1]gfx942 with ROCm 6.x+ requires amdhsa.version: [1, 2] (code object v5).
Using the older [1, 0] (v3) format causes invalid HSA metadata assembler
errors. Key differences from v3:
.value_typeand.is_constfields are no longer valid in kernel arg metadata.- Fields like
.reqd_workgroup_sizeare optional.
The LLVM MC assembler (as of ROCm 7.1 / clang 20) does not accept the lds
text modifier on buffer_load_dword for any GFX9/CDNA target. The instruction
encoding exists in hardware and the disassembler understands it, but the
assembler rejects it.
Workaround: emit the 64-bit MUBUF encoding manually via .long, setting
bit 16 (the LDS bit). A clean macro makes this readable:
.macro buffer_load_dword_offen_lds vdata, vaddr, srsrc_base
; DWORD 0: MUBUF major opcode (0x38), OP=0x14 (buffer_load_dword),
; LDS=1 (bit 16), OFFEN=1 (bit 12)
.long 0xE0511000
; DWORD 1: SOFFSET=0x80 (literal 0), SRSRC, VDATA, VADDR
.long (0x80 << 24) | ((\srsrc_base / 4) << 16) | (\vdata << 8) | \vaddr
.endmThe encoding 0xE0511000 was derived by assembling a regular
buffer_load_dword ... offen (which gives 0xE0501000) and flipping bit 16.
The disassembler confirms: buffer_load_dword v2, s[16:19], 0 offen lds.
An alternative instruction that is accepted by the assembler is
global_load_lds_dword, but it uses the global (flat) memory path rather than
the buffer path.
When the LDS bit is set, loaded data bypasses VGPRs and goes directly to LDS:
- M0 register holds the per-wave LDS byte base address.
- Each lane writes to LDS at address
M0 + lane_id * sizeof(element). - M0 must be set (via
s_mov_b32 m0, ...) before eachbuffer_load ... lds. - Use
v_readfirstlane_b32to extract the wave's first lane threadIdx into an SGPR, then shift left by 2 (multiply bysizeof(float)) to get M0.
v_readfirstlane_b32 s_tmp, v_threadIdx ; wave_id * 64
s_lshl_b32 s_tmp, s_tmp, 2 ; * sizeof(float)
s_mov_b32 m0, s_tmp ; set LDS write base
buffer_load_dword_offen_lds ... ; data -> LDS[M0 + lane*4]Buffer instructions require a 128-bit SRD in 4 consecutive SGPRs:
| Word | Value | Meaning |
|---|---|---|
| 0 | base_address[31:0] |
Low 32 bits of pointer |
| 1 | base_address[47:32] |
High 16 bits; upper bits zero for stride=0 |
| 2 | N * 4 |
num_records in bytes (enables OOB for idx >= N) |
| 3 | 0x00020000 |
gfx942 config: DATA_FORMAT=32, TYPE=0 (raw) |
Setting num_records = N * sizeof(float) rather than 0xFFFFFFFF enables the
OOB (out-of-bounds) behavior used for branchless control flow (see section 12).
Word 3 value 0x00020000 matches the
opus library's buffer_default_config() for
gfx942/gfx90a. TYPE=0 (raw buffer) means OOB loads return 0 and OOB stores
are silently dropped.
Word 1 must mask off bits [31:16] to zero the stride and swizzle fields:
s_and_b32 s[srd+1], s[ptr_hi], 0xFFFFgfx90a / gfx940 / gfx942 have a unified VGPR/AGPR register file. The
.amdhsa_accum_offset directive tells the allocator where accumulator registers
(AGPRs) begin. If no AGPRs are used, set it equal to .amdhsa_next_free_vgpr:
.amdhsa_next_free_vgpr 8
.amdhsa_accum_offset 8 ; no AGPRs usedOmitting this directive causes an assembler error on CDNA targets.
The AMDHSA calling convention places registers in this order:
| Register | Source | Enabled by |
|---|---|---|
s[0:1] |
kernarg segment pointer | .amdhsa_user_sgpr_kernarg_segment_ptr 1 |
s2 |
workgroup_id_x | .amdhsa_system_sgpr_workgroup_id_x 1 |
v0 |
workitem_id_x | .amdhsa_system_vgpr_workitem_id 0 |
If you also enable dispatch_ptr (2 SGPRs), it occupies s[0:1] and pushes
kernarg_ptr to s[2:3] and workgroup_id_x to s4.
The kernel argument layout in the .s file must exactly match the
__attribute__((packed)) struct on the host side. Every pointer is 8 bytes,
every int/uint32_t is 4 bytes, and total size must match
.kernarg_segment_size in the metadata. An alignment mismatch silently
produces wrong results.
struct __attribute__((packed)) {
float* A; // offset 0
float* B; // offset 8
float* C; // offset 16
uint32_t N; // offset 24 -- number of elements
uint32_t stride; // offset 28 -- num_CUs * 256 (grid-stride step)
} args;Since the kernel is a standalone code object (not compiled into the host binary), it is loaded at runtime:
hipModuleLoad(&module, "vector_add_kernel.hsaco");
hipModuleGetFunction(&kernel_func, module, "vector_add_kernel");
hipModuleLaunchKernel(kernel_func, gdx,1,1, bdx,1,1, 0, 0, NULL, (void**)&config);The config array uses HIP_LAUNCH_PARAM_BUFFER_POINTER /
HIP_LAUNCH_PARAM_BUFFER_SIZE / HIP_LAUNCH_PARAM_END to pass the raw
argument buffer.
Two independent wait counters control memory ordering:
| Counter | Tracks | Wait instruction |
|---|---|---|
vmcnt |
Global/buffer memory ops (loads, stores, buffer_load...lds, buffer_store) | s_waitcnt vmcnt(N) |
lgkmcnt |
LDS and scalar memory (kernarg loads, ds_read, ds_write) |
s_waitcnt lgkmcnt(0) |
For the async-to-LDS pattern, the basic sequence is:
buffer_load_dword ... lds ; global -> LDS (increments vmcnt)
s_waitcnt vmcnt(0) ; ensure LDS is written
ds_read_b32 ... ; LDS -> VGPR (increments lgkmcnt)
s_waitcnt lgkmcnt(0) ; ensure VGPRs are readyNo s_barrier is needed because each wave only reads its own LDS region.
Important: vmcnt is a FIFO -- s_waitcnt vmcnt(N) means "wait until at
most N operations remain outstanding." Both buffer_load ... lds and
buffer_store push onto the same FIFO, so stores must be accounted for when
choosing the wait value. See section 13 for the detailed vmcnt(3) analysis.
Instead of launching one workgroup per chunk of 256 elements, the host launches
exactly num_CUs workgroups (one per Compute Unit, detected at runtime via
hipGetDeviceProperties). Each workgroup then processes all its elements via a
grid-stride loop:
idx = global_id // initial element index
stride = num_CUs * 256 // passed as kernel arg
for (; idx < N; idx += stride) {
C[idx] = A[idx] + B[idx]
}
Benefits:
- Avoids kernel launch overhead for large N (one launch covers everything).
- Every CU is occupied for the entire kernel duration, maximizing utilization.
- Works correctly for any N, including N < total threads (extra lanes are masked).
On the host side this is straightforward:
int num_cu = props.multiProcessorCount; // e.g. 304 on MI300X
int bdx = 256;
int gdx = num_cu; // persistent: 1 workgroup per CU
uint32_t stride = gdx * bdx;Traditional GPU kernels use exec mask manipulation (s_and_saveexec_b64,
s_cbranch_execz, etc.) to prevent out-of-bounds memory accesses. This kernel
takes a simpler approach by exploiting the hardware OOB behavior of buffer
instructions:
| Buffer operation | OOB behavior (TYPE=0, raw) |
|---|---|
buffer_load ... offen lds |
Returns 0 to LDS (harmless) |
buffer_store ... offen |
Silently dropped (no side effects) |
By setting num_records = N * 4 in every SRD (A, B, and C), any lane whose
byte offset >= N * 4 automatically gets this safe behavior. The kernel
never touches the exec mask -- all 64 lanes always execute every
instruction.
; SRD construction -- the key line
s_lshl_b32 s[s_res_a+2], s[s_n], 2 ; num_records = N * sizeof(float)Loop termination uses a simple scalar comparison:
v_cmp_gt_u32 vcc, s[s_n], v[v_idx] ; any lane still in-bounds?
s_cbranch_vccz L_done ; if none → exitThis eliminates all s_and_b64 exec, s_or_b64 exec, s_mov_b64 exec and
s_cbranch_execz instructions, producing cleaner and shorter code.
The kernel uses two LDS buffers in a ping-pong arrangement with a deep prologue that fills both buffers before the loop begins.
Byte offset Contents
──────────────────────────────────────
[ 0, 1024) Buffer 0 -- A values (256 threads x 4 bytes)
[1024, 2048) Buffer 0 -- B values
[2048, 3072) Buffer 1 -- A values
[3072, 4096) Buffer 1 -- B values
s_m0_buf0_a = wave_lds_base + 0 ; buffer 0, A region
s_m0_buf0_b = wave_lds_base + 1024 ; buffer 0, B region
s_m0_buf1_a = wave_lds_base + 2048 ; buffer 1, A region
s_m0_buf1_b = wave_lds_base + 3072 ; buffer 1, B region; iter 0 → buf0
buffer_load A[idx] -> buf0 ; vmcnt 1
buffer_load B[idx] -> buf0 ; vmcnt 2
; iter 1 → buf1
buffer_load A[idx+s] -> buf1 ; vmcnt 3
buffer_load B[idx+s] -> buf1 ; vmcnt 4
s_waitcnt vmcnt(2) ; drain oldest 2 → buf0 is readyEach half reads from its buffer, then prefetches into that same buffer
for 2 half-iterations ahead (idx + 2*stride):
L_process_buf0: L_process_buf1:
ds_read A, B from buf0 ds_read A, B from buf1
lgkmcnt(0) lgkmcnt(0)
prefetch A[idx+2s] -> buf0 prefetch A[idx+2s] -> buf1
prefetch B[idx+2s] -> buf0 prefetch B[idx+2s] -> buf1
v_add_f32 v_add_f32
buffer_store C[idx] buffer_store C[idx]
advance idx advance idx
if no lanes valid → L_done if no lanes valid → L_done
s_waitcnt vmcnt(3) s_waitcnt vmcnt(3)
──► L_process_buf1 ──► L_process_buf0
Choosing the right vmcnt(N) value is arguably the most subtle part of a
double-buffered GPU kernel. Getting it wrong either causes correctness bugs
(too loose) or kills performance (too tight).
The vmcnt FIFO tracks all outstanding buffer_load and buffer_store
operations. s_waitcnt vmcnt(N) means "wait until at most N entries remain
in the FIFO."
Consider the steady-state at the top of L_process_buf0, just after the
vmcnt(3) at the end of L_process_buf1:
vmcnt FIFO (oldest → newest):
[already drained by vmcnt(3)] ← previous buf0's 2 loads + 1 store
entry 3: prefetch A -> buf1 (from L_process_buf1 we just left)
entry 2: prefetch B -> buf1 (from L_process_buf1 we just left)
entry 1: buffer_store C (from L_process_buf1 we just left)
─── vmcnt = 3, exactly what we specified ───
At this point buf0's data was loaded 2 half-iterations ago -- it is
guaranteed ready because vmcnt(3) drained everything older than the 3
entries from the half-iteration we just completed.
If we used:
vmcnt(0): Everything drained -- safe but kills pipelining. No overlap between memory ops and compute.vmcnt(2): Would only keep 2 entries (the 2 prefetches), meaning the store is also drained. This is slightly tighter than needed.vmcnt(3): Keeps all 3 entries from the previous half in flight while guaranteeing the current buffer is ready. Optimal overlap.
| Location | Wait value | Reason |
|---|---|---|
| After prologue | vmcnt(2) |
4 loads issued; drain oldest 2 (buf0 ready), buf1's 2 loads stay in flight |
| Inside loop | vmcnt(3) |
Drain prev half's 2 loads + 1 store; keep current half's 2 prefetches + 1 store |
L_done (epilogue) |
vmcnt(0) |
Drain everything before final stores complete |