Skip to content

Commit a790251

Browse files
erwei-xilinxclaude
andcommitted
[multi-gpu] Phase 2: kernel-driven producer/consumer rewrite
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated mgpuMemcpy reference test with a kernel-driven producer/consumer pair. Cross-rank data movement is now performed by GPU compute units issuing loads/stores directly into peer HBM over XGMI, not by the HIP copy engine. Changes: - air_sym_handwritten.mlir is rewritten as one gpu.module with two gpu.func kernels: * producer (rank 0): each thread writes 42.0 into rank 1's `data` via memref.store on a peer memref produced by air.translate. Lane 0 of each warp signals the per-warp flag with a release atomicrmw on rank 1's `flags`. * consumer (rank 1): lane 0 of each warp spins on its flag with an acquire atomic load until producer signals; gpu.barrier then releases all 64 lanes to read their data slot and copy it into a verify buffer. Host D2H reads verify_buf and checks 42.0. The host driver (func.func @main) initializes the symmetric heap, copies heap_bases into a device-resident buffer (workaround for the fact that mgpuGetHeapBases returns a host pointer), and dispatches the producer or consumer kernel based on rank. - run.sh adds the GPU compilation chain (rocdl-attach-target, convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region, gpu-to-llvm) before mlir-runner. - run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each rank sees only its own GPU as device 0. This eliminates the device-binding ambiguity between airgpu's hipSetDevice and MLIR's built-in gpu.launch_func handling that would otherwise cause rank N>0 to fail with hipErrorInvalidDevice when launching kernels. Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1): W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0) W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle) W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle) This is the first time GPU compute units (not the HIP copy engine) have been observed driving cross-rank data movement over XGMI in this stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e8492fb commit a790251

2 files changed

Lines changed: 303 additions & 132 deletions

File tree

0 commit comments

Comments
 (0)