Description
Problem Description
Following guidance in the HIP porting guide, HIP cpp extensions guide, and ROCm 6.4 release notes, I'm trying to modify some code that has been ported to HIP to check the warp size at runtime. There are existing kernels that use warp size as a template parameter, so I followed the general technique described in the reduction tutorial of using a switch to promote warp size to a compile time constant (they use some neat C++20 tricks that I alas cannot, however).
I'm using a MI200 (gfx90a) GPU right now, but I want to also be able to compile for RDNA. But if I try to compile my code with --offload-arch=gfx1100
I hit static asserts in amd_hip_cooperative_groups.h. Just instantiating a template that would have a tile size of 64 is a compilation error.
static assertion failed: Tile size is either not a power of 2 or greater than the wavefront size
In file included from rocm-dynamic-warp-size-test.hip:2:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:660:17: error: static assertion failed due to requirement 'integral_constant::value': Tile size is either not a power of 2 or greater than the wavefront size
660 | static_assert(is_valid_tile_size::value,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:746:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here
746 | class thread_block_tile_type : public thread_block_tile_base,
| ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:843:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64, cooperative_groups::thread_block>' requested here
843 | class thread_block_tile_internal : public thread_block_tile_type {
| ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:856:34: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, cooperative_groups::thread_block>' requested here
856 | class thread_block_tile : public impl::thread_block_tile_internal {
| ^
rocm-dynamic-warp-size-test.hip:24:7: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, cooperative_groups::thread_block>' requested here
24 | cg::tiled_partition(cg::this_thread_block());
| ^
rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here
57 | kernel<<<1, block_size, 0, stream>>>(d_counter);
| ^
static assertion failed: Tiled partition with size > wavefront size. Currently not supported
In file included from rocm-dynamic-warp-size-test.hip:2:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:899:17: error: static assertion failed due to requirement 'integral_constant::value': Tiled partition with size > wavefront size. Currently not supported
899 | static_assert(is_valid_tile_size::value,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rocm-dynamic-warp-size-test.hip:24:11: note: in instantiation of function template specialization 'cooperative_groups::tiled_partition<64U, cooperative_groups::thread_block>' requested here
24 | cg::tiled_partition(cg::this_thread_block());
| ^
rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here
57 | kernel<<<1, block_size, 0, stream>>>(d_counter);
| ^
2 errors generated when compiling for gfx1100.
I'm especially confused because I get the impression that RDNA is the main reason for insisting that warp size is dynamic. From the docs:
RDNA architectures have a configurable wavefront size. The native wavefront size is 32, but they can run in “CU mode”, which has an effective wavefront size of 64. This affects the number of resident wavefronts and blocks per compute Unit.
https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html#id13
AFAICT this is something that's decided at runtime
RDNA architectures [warp size] can even differ between kernel launches, depending on whether they run in CU or WGP mode... Since warpSize can differ between devices, it can not be assumed to be a compile-time constant on the host
https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size
It seems like HIP itself is still assuming that warp size is a compile-time constant. How is this supposed to work? I can't just trivially drop the template parameter and make the kernel look up the warp size itself because it's used for things like array sizes.
Operating System
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
CPU
AMD Ryzen Threadripper PRO 7975WX 32-Cores
GPU
AMD Instinct MI210
ROCm Version
6.4
ROCm Component
clr
Steps to Reproduce
rocm-dynamic-warp-size-test.hip
// rocm-dynamic-warp-size-test.hip
#include <hip/hip_cooperative_groups.h>
#include "hip/hip_runtime.h"
namespace cg = cooperative_groups;
#define HIP_CHECK(val) \
{ \
hip_check_((val), __FILE__, __LINE__); \
}
class HipException : public std::runtime_error {
public:
HipException(const std::string& what) : runtime_error(what) {}
};
inline void hip_check_(hipError_t val, const char* file, int line) {
if (val != hipSuccess) {
throw HipException(
std::string(file) + ":" + std::to_string(line) + ": HIP error " +
std::to_string(val) + ": " + hipGetErrorString(val));
}
}
template <int warp_size>
__global__ void kernel(size_t* d_counter) {
cg::thread_block_tile<warp_size> warp_tile =
cg::tiled_partition<warp_size>(cg::this_thread_block());
const size_t lane_idx = warp_tile.thread_rank();
atomicAdd(d_counter, 1);
}
int main(int argc, char** argv) {
hipStream_t stream;
size_t h_counter;
size_t* d_counter;
HIP_CHECK(hipMalloc((void**)&d_counter, sizeof(size_t)));
HIP_CHECK(hipStreamCreate(&stream));
HIP_CHECK(hipStreamSynchronize(stream));
int device;
HIP_CHECK(hipGetDevice(&device));
int warp_size = 0;
HIP_CHECK(
hipDeviceGetAttribute(&warp_size, hipDeviceAttributeWarpSize, device));
switch (warp_size) {
case 32: {
constexpr int warp_size = 32;
constexpr int block_count = 1;
constexpr int block_size = warp_size;
kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
} break;
case 64: {
constexpr int warp_size = 64;
constexpr int block_count = 1;
constexpr int block_size = warp_size;
kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
} break;
default:
throw HipException(
"Expected warp size 32 or 64 but got " + std::to_string(warp_size));
}
HIP_CHECK(hipMemcpyAsync(
&h_counter, d_counter, sizeof(size_t), hipMemcpyDeviceToHost, stream));
HIP_CHECK(hipStreamSynchronize(stream));
printf("h_counter=%zu\n", h_counter);
HIP_CHECK(hipStreamDestroy(stream));
HIP_CHECK(hipFree(d_counter));
return 0;
}
dynamic_warp_size_test.sh
#!/bin/bash
# dynamic_warp_size_test.sh
set -euo pipefail
BASENAME=rocm-dynamic-warp-size-test
ARCH="$1"
/opt/rocm/lib/llvm/bin/clang++ \
-DHIP_ENABLE_WARP_SYNC_BUILTINS \
-D__HIP_ROCclr__=1 \
-g3 \
-ggdb3 \
--offload-arch="${ARCH}" \
-fPIE \
-O1 \
-std=gnu++17 \
-o "${BASENAME}.hip.o" \
-x hip \
-c "${BASENAME}.hip"
/opt/rocm/lib/llvm/bin/clang++ \
-O0 \
-g3 \
-ggdb3 \
--offload-arch=gfx90a \
--hip-link \
--rtlib=compiler-rt \
-unwindlib=libgcc \
"${BASENAME}.hip.o" \
-o "${BASENAME}" \
/opt/rocm/lib/libamdhip64.so.6.4.60400
"./${BASENAME}"
./dynamic_warp_size_test.sh gfx90a
./dynamic_warp_size_test.sh gfx1100
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response