Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
build.sh	build.sh
main.cpp	main.cpp
vector_add_kernel.s	vector_add_kernel.s

Vector Add -- Hand-Written GCN Assembly for gfx942 (MI300)

A minimal but complete example of a hand-written AMDGPU assembly kernel that performs C[i] = A[i] + B[i] on float32 arrays, using the buffer_load_dword ... offen lds instruction to load data directly from global memory into LDS, bypassing VGPRs entirely.

The kernel demonstrates several GPU programming techniques:

Persistent kernel: grid size = number of CUs (detected at runtime); each workgroup processes multiple chunks via a grid-stride loop.
Double LDS buffering with deep pipeline fill: both buffers are loaded in the prologue; each loop half reads one buffer while prefetching it for 2 iterations ahead.
OOB-based control flow: SRDs use num_records = N * 4 so that out-of-bounds buffer loads return 0 and stores are silently dropped, eliminating all exec mask manipulation.
vmcnt(3) accounting: carefully chosen wait count that drains exactly the previous half-iteration's 2 prefetch loads + 1 store, keeping the current half's operations fully in flight.

Files

File	Description
`vector_add_kernel.s`	GCN assembly kernel (gfx942, CDNA3)
`main.cpp`	HIP host code -- loads `.hsaco`, launches kernel, verifies results
`build.sh`	Assembles the `.s` and compiles the host into a single executable

Build & Run

Must be run inside a ROCm docker container with gfx942 hardware available:

# Launch docker (adjust image as needed)
docker run -it --privileged --network=host \
    --device=/dev/kfd --device=/dev/dri --group-add video \
    -v /home/$USER:/dockerx -v /mnt/raid0:/raid0 \
    rocm/atom:nightly_202601190317

# Inside docker
cd /raid0/<path>/gcnasm/vector_add_asm
bash build.sh
./vector_add_asm.exe

Build steps performed by build.sh:

Assemble .s to .hsaco code object: clang++ -x assembler -target amdgcn--amdhsa -mcpu=gfx942 vector_add_kernel.s -o vector_add_kernel.hsaco
Compile host code: hipcc main.cpp -o vector_add_asm.exe

Data Flow (single iteration)

Global Memory (A,B)
        |
        |  buffer_load_dword ... offen lds   (VGPR bypassed, OOB → 0)
        v
       LDS
        |
        |  ds_read_b32
        v
      VGPRs
        |
        |  v_add_f32
        v
      VGPRs
        |
        |  buffer_store_dword ... offen      (OOB → silently dropped)
        v
Global Memory (C)

Deep-Pipeline Double LDS Buffer (steady state)

PROLOGUE (fill both buffers):
    buffer_load A[iter0] -> buf0         vmcnt = 1
    buffer_load B[iter0] -> buf0         vmcnt = 2
    buffer_load A[iter1] -> buf1         vmcnt = 3
    buffer_load B[iter1] -> buf1         vmcnt = 4
    s_waitcnt vmcnt(2)                   buf0 ready, buf1 in flight

MAIN LOOP (each half-iteration, steady state entry vmcnt = 3):

    ┌── ds_read A, B from current buffer
    │   s_waitcnt lgkmcnt(0)
    │
    │   buffer_load A[idx+2*stride] ──┐   prefetch into SAME buffer
    │   buffer_load B[idx+2*stride] ──┘   (we already read it above)
    │
    │   v_add_f32 A, B
    │   buffer_store C[idx]               (OOB → dropped)
    │
    │   idx += stride
    │   s_cbranch_vccz L_done             (exit when all lanes done)
    └── s_waitcnt vmcnt(3)                drain prev half's loads + store

    swap to other buffer, repeat

Key design points:

Prologue fills both buffers so the loop body never stalls on the first read. vmcnt(2) after 4 loads drains the oldest 2 (buf0), leaving buf1 in flight.
Prefetch goes into the same buffer just read (not the alternate), loading data for idx + 2*stride (2 half-iterations ahead).
vmcnt(3) in the loop drains exactly the previous half's 2 prefetch loads + 1 store, keeping the current half's 2 prefetches + store in flight. See section 13 for the detailed FIFO trace.

Important Points & Lessons Learned

1. gfx942 global instruction addressing uses a single VGPR offset

On gfx942 (CDNA3), when a scalar base address (saddr) is provided, global_load_dword / global_store_dword take a single VGPR as the 32-bit signed byte offset -- not a VGPR pair like older GFX9 targets (gfx900/gfx906):

; gfx942 -- single VGPR offset
global_load_dword  v_dst, v_offset, s[base:base+1]

; gfx900 -- VGPR pair (only lower half used as offset)
global_load_dword  v_dst, v[offset:offset+1], s[base:base+1]

2. Code object v5 metadata is required

gfx942 with ROCm 6.x+ requires amdhsa.version: [1, 2] (code object v5). Using the older [1, 0] (v3) format causes invalid HSA metadata assembler errors. Key differences from v3:

.value_type and .is_const fields are no longer valid in kernel arg metadata.
Fields like .reqd_workgroup_size are optional.

3. `buffer_load_dword ... lds` -- the assembler does not expose it

The LLVM MC assembler (as of ROCm 7.1 / clang 20) does not accept the lds text modifier on buffer_load_dword for any GFX9/CDNA target. The instruction encoding exists in hardware and the disassembler understands it, but the assembler rejects it.

Workaround: emit the 64-bit MUBUF encoding manually via .long, setting bit 16 (the LDS bit). A clean macro makes this readable:

.macro buffer_load_dword_offen_lds vdata, vaddr, srsrc_base
    ; DWORD 0: MUBUF major opcode (0x38), OP=0x14 (buffer_load_dword),
    ;          LDS=1 (bit 16), OFFEN=1 (bit 12)
    .long 0xE0511000
    ; DWORD 1: SOFFSET=0x80 (literal 0), SRSRC, VDATA, VADDR
    .long (0x80 << 24) | ((\srsrc_base / 4) << 16) | (\vdata << 8) | \vaddr
.endm

The encoding 0xE0511000 was derived by assembling a regular buffer_load_dword ... offen (which gives 0xE0501000) and flipping bit 16. The disassembler confirms: buffer_load_dword v2, s[16:19], 0 offen lds.

An alternative instruction that is accepted by the assembler is global_load_lds_dword, but it uses the global (flat) memory path rather than the buffer path.

4. `buffer_load ... lds` -- how LDS addressing works

When the LDS bit is set, loaded data bypasses VGPRs and goes directly to LDS:

M0 register holds the per-wave LDS byte base address.
Each lane writes to LDS at address M0 + lane_id * sizeof(element).
M0 must be set (via s_mov_b32 m0, ...) before each buffer_load ... lds.
Use v_readfirstlane_b32 to extract the wave's first lane threadIdx into an SGPR, then shift left by 2 (multiply by sizeof(float)) to get M0.

v_readfirstlane_b32 s_tmp, v_threadIdx    ; wave_id * 64
s_lshl_b32         s_tmp, s_tmp, 2        ; * sizeof(float)
s_mov_b32          m0, s_tmp              ; set LDS write base
buffer_load_dword_offen_lds ...           ; data -> LDS[M0 + lane*4]

5. Buffer Resource Descriptor (SRD) construction for gfx942

Buffer instructions require a 128-bit SRD in 4 consecutive SGPRs:

Word	Value	Meaning
0	`base_address[31:0]`	Low 32 bits of pointer
1	`base_address[47:32]`	High 16 bits; upper bits zero for stride=0
2	`N * 4`	`num_records` in bytes (enables OOB for idx >= N)
3	`0x00020000`	gfx942 config: `DATA_FORMAT=32`, `TYPE=0` (raw)

Setting num_records = N * sizeof(float) rather than 0xFFFFFFFF enables the OOB (out-of-bounds) behavior used for branchless control flow (see section 12).

Word 3 value 0x00020000 matches the opus library's buffer_default_config() for gfx942/gfx90a. TYPE=0 (raw buffer) means OOB loads return 0 and OOB stores are silently dropped.

Word 1 must mask off bits [31:16] to zero the stride and swizzle fields:

s_and_b32 s[srd+1], s[ptr_hi], 0xFFFF

6. `.amdhsa_accum_offset` is mandatory on CDNA

gfx90a / gfx940 / gfx942 have a unified VGPR/AGPR register file. The .amdhsa_accum_offset directive tells the allocator where accumulator registers (AGPRs) begin. If no AGPRs are used, set it equal to .amdhsa_next_free_vgpr:

.amdhsa_next_free_vgpr 8
.amdhsa_accum_offset   8      ; no AGPRs used

Omitting this directive causes an assembler error on CDNA targets.

7. SGPR ordering: user SGPRs first, then system SGPRs

The AMDHSA calling convention places registers in this order:

Register	Source	Enabled by
`s[0:1]`	kernarg segment pointer	`.amdhsa_user_sgpr_kernarg_segment_ptr 1`
`s2`	workgroup_id_x	`.amdhsa_system_sgpr_workgroup_id_x 1`
`v0`	workitem_id_x	`.amdhsa_system_vgpr_workitem_id 0`

If you also enable dispatch_ptr (2 SGPRs), it occupies s[0:1] and pushes kernarg_ptr to s[2:3] and workgroup_id_x to s4.

8. Kernel arguments and host-side struct packing

The kernel argument layout in the .s file must exactly match the __attribute__((packed)) struct on the host side. Every pointer is 8 bytes, every int/uint32_t is 4 bytes, and total size must match .kernarg_segment_size in the metadata. An alignment mismatch silently produces wrong results.

struct __attribute__((packed)) {
    float*   A;        // offset 0
    float*   B;        // offset 8
    float*   C;        // offset 16
    uint32_t N;        // offset 24  -- number of elements
    uint32_t stride;   // offset 28  -- num_CUs * 256 (grid-stride step)
} args;

9. Host loads `.hsaco` at runtime via `hipModule*` APIs

Since the kernel is a standalone code object (not compiled into the host binary), it is loaded at runtime:

hipModuleLoad(&module, "vector_add_kernel.hsaco");
hipModuleGetFunction(&kernel_func, module, "vector_add_kernel");
hipModuleLaunchKernel(kernel_func, gdx,1,1, bdx,1,1, 0, 0, NULL, (void**)&config);

The config array uses HIP_LAUNCH_PARAM_BUFFER_POINTER / HIP_LAUNCH_PARAM_BUFFER_SIZE / HIP_LAUNCH_PARAM_END to pass the raw argument buffer.

10. Synchronization: `vmcnt` vs `lgkmcnt`

Two independent wait counters control memory ordering:

Counter	Tracks	Wait instruction
`vmcnt`	Global/buffer memory ops (loads, stores, buffer_load...lds, buffer_store)	`s_waitcnt vmcnt(N)`
`lgkmcnt`	LDS and scalar memory (kernarg loads, `ds_read`, `ds_write`)	`s_waitcnt lgkmcnt(0)`

For the async-to-LDS pattern, the basic sequence is:

buffer_load_dword ... lds       ; global -> LDS  (increments vmcnt)
s_waitcnt vmcnt(0)              ; ensure LDS is written

ds_read_b32 ...                 ; LDS -> VGPR   (increments lgkmcnt)
s_waitcnt lgkmcnt(0)            ; ensure VGPRs are ready

No s_barrier is needed because each wave only reads its own LDS region.

Important: vmcnt is a FIFO -- s_waitcnt vmcnt(N) means "wait until at most N operations remain outstanding." Both buffer_load ... lds and buffer_store push onto the same FIFO, so stores must be accounted for when choosing the wait value. See section 13 for the detailed vmcnt(3) analysis.

11. Persistent kernel -- grid = number of CUs

Instead of launching one workgroup per chunk of 256 elements, the host launches exactly num_CUs workgroups (one per Compute Unit, detected at runtime via hipGetDeviceProperties). Each workgroup then processes all its elements via a grid-stride loop:

idx = global_id                        // initial element index
stride = num_CUs * 256                 // passed as kernel arg
for (; idx < N; idx += stride) {
    C[idx] = A[idx] + B[idx]
}

Benefits:

Avoids kernel launch overhead for large N (one launch covers everything).
Every CU is occupied for the entire kernel duration, maximizing utilization.
Works correctly for any N, including N < total threads (extra lanes are masked).

On the host side this is straightforward:

int num_cu = props.multiProcessorCount;  // e.g. 304 on MI300X
int bdx = 256;
int gdx = num_cu;                        // persistent: 1 workgroup per CU
uint32_t stride = gdx * bdx;

12. OOB-based control flow -- no exec mask needed

Traditional GPU kernels use exec mask manipulation (s_and_saveexec_b64, s_cbranch_execz, etc.) to prevent out-of-bounds memory accesses. This kernel takes a simpler approach by exploiting the hardware OOB behavior of buffer instructions:

Buffer operation	OOB behavior (`TYPE=0`, raw)
`buffer_load ... offen lds`	Returns 0 to LDS (harmless)
`buffer_store ... offen`	Silently dropped (no side effects)

By setting num_records = N * 4 in every SRD (A, B, and C), any lane whose byte offset >= N * 4 automatically gets this safe behavior. The kernel never touches the exec mask -- all 64 lanes always execute every instruction.

; SRD construction -- the key line
s_lshl_b32 s[s_res_a+2], s[s_n], 2    ; num_records = N * sizeof(float)

Loop termination uses a simple scalar comparison:

v_cmp_gt_u32 vcc, s[s_n], v[v_idx]    ; any lane still in-bounds?
s_cbranch_vccz L_done                  ; if none → exit

This eliminates all s_and_b64 exec, s_or_b64 exec, s_mov_b64 exec and s_cbranch_execz instructions, producing cleaner and shorter code.

13. Double LDS buffering with deep pipeline fill

The kernel uses two LDS buffers in a ping-pong arrangement with a deep prologue that fills both buffers before the loop begins.

LDS layout (4096 bytes total)

Byte offset    Contents
──────────────────────────────────────
[   0, 1024)   Buffer 0 -- A values   (256 threads x 4 bytes)
[1024, 2048)   Buffer 0 -- B values
[2048, 3072)   Buffer 1 -- A values
[3072, 4096)   Buffer 1 -- B values

Four M0 values (pre-computed, loop-invariant)

s_m0_buf0_a = wave_lds_base + 0       ; buffer 0, A region
s_m0_buf0_b = wave_lds_base + 1024    ; buffer 0, B region
s_m0_buf1_a = wave_lds_base + 2048    ; buffer 1, A region
s_m0_buf1_b = wave_lds_base + 3072    ; buffer 1, B region

Prologue -- fill both buffers

; iter 0 → buf0
buffer_load A[idx]     -> buf0     ; vmcnt 1
buffer_load B[idx]     -> buf0     ; vmcnt 2
; iter 1 → buf1
buffer_load A[idx+s]   -> buf1     ; vmcnt 3
buffer_load B[idx+s]   -> buf1     ; vmcnt 4
s_waitcnt vmcnt(2)                 ; drain oldest 2 → buf0 is ready

Unrolled loop with prefetch-to-same-buffer

Each half reads from its buffer, then prefetches into that same buffer for 2 half-iterations ahead (idx + 2*stride):

L_process_buf0:                         L_process_buf1:
    ds_read A, B from buf0                  ds_read A, B from buf1
    lgkmcnt(0)                              lgkmcnt(0)
    prefetch A[idx+2s] -> buf0              prefetch A[idx+2s] -> buf1
    prefetch B[idx+2s] -> buf0              prefetch B[idx+2s] -> buf1
    v_add_f32                               v_add_f32
    buffer_store C[idx]                     buffer_store C[idx]
    advance idx                             advance idx
    if no lanes valid → L_done              if no lanes valid → L_done
    s_waitcnt vmcnt(3)                      s_waitcnt vmcnt(3)
    ──► L_process_buf1                      ──► L_process_buf0

14. `vmcnt(3)` -- the critical synchronization accounting

Choosing the right vmcnt(N) value is arguably the most subtle part of a double-buffered GPU kernel. Getting it wrong either causes correctness bugs (too loose) or kills performance (too tight).

Why `vmcnt(3)` and not `vmcnt(0)` or `vmcnt(2)`?

The vmcnt FIFO tracks all outstanding buffer_load and buffer_store operations. s_waitcnt vmcnt(N) means "wait until at most N entries remain in the FIFO."

Consider the steady-state at the top of L_process_buf0, just after the vmcnt(3) at the end of L_process_buf1:

vmcnt FIFO (oldest → newest):
  [already drained by vmcnt(3)]          ← previous buf0's 2 loads + 1 store
  entry 3: prefetch A -> buf1 (from L_process_buf1 we just left)
  entry 2: prefetch B -> buf1 (from L_process_buf1 we just left)
  entry 1: buffer_store C     (from L_process_buf1 we just left)
  ─── vmcnt = 3, exactly what we specified ───

At this point buf0's data was loaded 2 half-iterations ago -- it is guaranteed ready because vmcnt(3) drained everything older than the 3 entries from the half-iteration we just completed.

If we used:

vmcnt(0): Everything drained -- safe but kills pipelining. No overlap between memory ops and compute.
vmcnt(2): Would only keep 2 entries (the 2 prefetches), meaning the store is also drained. This is slightly tighter than needed.
vmcnt(3): Keeps all 3 entries from the previous half in flight while guaranteeing the current buffer is ready. Optimal overlap.

Special cases

Location	Wait value	Reason
After prologue	`vmcnt(2)`	4 loads issued; drain oldest 2 (buf0 ready), buf1's 2 loads stay in flight
Inside loop	`vmcnt(3)`	Drain prev half's 2 loads + 1 store; keep current half's 2 prefetches + 1 store
`L_done` (epilogue)	`vmcnt(0)`	Drain everything before final stores complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Vector Add -- Hand-Written GCN Assembly for gfx942 (MI300)

Files

Build & Run

Data Flow (single iteration)

Deep-Pipeline Double LDS Buffer (steady state)

Important Points & Lessons Learned

1. gfx942 global instruction addressing uses a single VGPR offset

2. Code object v5 metadata is required

3. `buffer_load_dword ... lds` -- the assembler does not expose it

4. `buffer_load ... lds` -- how LDS addressing works

5. Buffer Resource Descriptor (SRD) construction for gfx942

6. `.amdhsa_accum_offset` is mandatory on CDNA

7. SGPR ordering: user SGPRs first, then system SGPRs

8. Kernel arguments and host-side struct packing

9. Host loads `.hsaco` at runtime via `hipModule*` APIs

10. Synchronization: `vmcnt` vs `lgkmcnt`

11. Persistent kernel -- grid = number of CUs

12. OOB-based control flow -- no exec mask needed

13. Double LDS buffering with deep pipeline fill

LDS layout (4096 bytes total)

Four M0 values (pre-computed, loop-invariant)

Prologue -- fill both buffers

Unrolled loop with prefetch-to-same-buffer

14. `vmcnt(3)` -- the critical synchronization accounting

Why `vmcnt(3)` and not `vmcnt(0)` or `vmcnt(2)`?

Special cases

FilesExpand file tree

vector_add_asm

Directory actions

More options

Directory actions

More options

Latest commit

History

vector_add_asm

Folders and files

parent directory

README.md

Vector Add -- Hand-Written GCN Assembly for gfx942 (MI300)

Files

Build & Run

Data Flow (single iteration)

Deep-Pipeline Double LDS Buffer (steady state)

Important Points & Lessons Learned

1. gfx942 global instruction addressing uses a single VGPR offset

2. Code object v5 metadata is required

3. buffer_load_dword ... lds -- the assembler does not expose it

4. buffer_load ... lds -- how LDS addressing works

5. Buffer Resource Descriptor (SRD) construction for gfx942

6. .amdhsa_accum_offset is mandatory on CDNA

7. SGPR ordering: user SGPRs first, then system SGPRs

8. Kernel arguments and host-side struct packing

9. Host loads .hsaco at runtime via hipModule* APIs

10. Synchronization: vmcnt vs lgkmcnt

11. Persistent kernel -- grid = number of CUs

12. OOB-based control flow -- no exec mask needed

13. Double LDS buffering with deep pipeline fill

LDS layout (4096 bytes total)

Four M0 values (pre-computed, loop-invariant)

Prologue -- fill both buffers

Unrolled loop with prefetch-to-same-buffer

14. vmcnt(3) -- the critical synchronization accounting

Why vmcnt(3) and not vmcnt(0) or vmcnt(2)?

Special cases

3. `buffer_load_dword ... lds` -- the assembler does not expose it

4. `buffer_load ... lds` -- how LDS addressing works

6. `.amdhsa_accum_offset` is mandatory on CDNA

9. Host loads `.hsaco` at runtime via `hipModule*` APIs

10. Synchronization: `vmcnt` vs `lgkmcnt`

14. `vmcnt(3)` -- the critical synchronization accounting

Why `vmcnt(3)` and not `vmcnt(0)` or `vmcnt(2)`?