Add a HIP/ROCm device backend for KV-transfer and Hamming kernels by jeffdaily · Pull Request #1021 · ModelEngine-Group/unified-cache-management

jeffdaily · 2026-06-11T22:34:38Z

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a __CUDA_ARCH__ or USE_ROCM guard that the CUDA build does not
compile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with export PLATFORM=rocm (or
-DRUNTIME_ENVIRONMENT=rocm when invoking CMake directly).

Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.

The compat shim resolves the project's #include <cuda_runtime.h>/<cuda.h>
to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.

Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
#if defined(__CUDA_ARCH__) with a HIP branch doing a plain vectorized uint4
load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.

The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.

A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.

Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document PLATFORM=rocm beside PLATFORM=cuda.

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).

Store/trans C++ surface plus unit tests:

cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \
  -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_rocm -j16
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1

79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.

Hamming kernel (against a ROCm PyTorch):

cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \
  -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_sparse -j16 --target hamming
HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \
  python ucm/sparse/test/gsa/test_hamming_rocm_ref.py

The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.

The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming scoring kernel build and run on AMD GPUs via HIP. The backend is purely additive: the existing backends are not modified. We have made every effort to leave the NVIDIA build unchanged -- every source change to a shared file is behind a `__CUDA_ARCH__` or `USE_ROCM` guard that the CUDA build does not compile, and the compat shim is only placed on the include path for the rocm build, never the cuda one. Select it with `export PLATFORM=rocm` (or `-DRUNTIME_ENVIRONMENT=rocm` when invoking CMake directly). Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then the two guarded kernel sources, then the docs and Windows host-build guards. The compat shim resolves the project's `#include <cuda_runtime.h>`/`<cuda.h>` to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use (Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*. Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD targets need no source edit. Root cause of the one non-mechanical change: the two grid-stride copy kernels used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores) that does not exist on AMD. The PTX is now guarded by `#if defined(__CUDA_ARCH__)` with a HIP branch doing a plain vectorized uint4 load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the __ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are cache-policy hints rather than visibility semantics for this memcpy (each thread writes a disjoint unit and the only consumer is the host after a stream sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh already selects its portable non-PTX fallback under hipcc, so it needed no change. The sparse Hamming module links libtorch. operator.h now includes <ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20 (torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses INFINITY as a masking sentinel that finite-math would drop), and with pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the module init symbol under -fvisibility=hidden. A set of WIN32-guarded host-build fixes let the backend also compile with the clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded, three header-only infra sub-libraries changed from OBJECT to INTERFACE so the linker language is determinable under Ninja+clang-cl, a getpid shim, metrics symbol export, and excluding a POSIX-only thread test). The GPU device code is unchanged by those guards. Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and SGLang quickstarts document `PLATFORM=rocm` beside `PLATFORM=cuda`. This work was authored with the assistance of Claude, an AI assistant. Test Plan: Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU visible (HIP_VISIBLE_DEVICES=0). Store/trans C++ surface plus unit tests: ``` cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \ -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_rocm -j16 HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1 ``` 79/80 pass on Linux. The copy-kernel correctness gates all pass: UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases (store H2D/D2H batch copy with readback). The one failure, UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated to the GPU backend. Hamming kernel (against a ROCm PyTorch): ``` cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \ -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_sparse -j16 --target hamming HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \ python ucm/sparse/test/gsa/test_hamming_rocm_ref.py ``` The new test computes an independent CPU popcount reference for the paged block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa (min over kv heads) matches exactly, two-run output is bit-identical. The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards select the original inline-PTX branch (compile-checked with nvcc 12.8 at sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global streaming ops), so the CUDA build compiles the same device code as before.

flesher0813 · 2026-06-12T01:36:31Z

        match PLATFORM:
            case "cuda":
                cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"]
+            case "rocm":


setup.py adds PLATFORM=rocm support, but the warning message still lists only cuda/ascend/ascend-a3/musa/maca. Please include rocm there as well.

jeffdaily requested review from FangRun2, Infinite666, Tarrei, flesher0813, mag1c-h, wangwenxin0312, wuhuxiao and ygwpz as code owners June 11, 2026 22:34

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

UCM: lead -> pr-open (ModelEngine-Group/unified-cache-management#1021)

8eead7c

flesher0813 reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021

Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021
jeffdaily wants to merge 1 commit into
ModelEngine-Group:developfrom
jeffdaily:moat-port

jeffdaily commented Jun 11, 2026

Uh oh!

flesher0813 Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffdaily commented Jun 11, 2026

Uh oh!

flesher0813 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants