Skip to content

[Issue]: How to actually build for dynamic warp size (cooperative groups static_assert) #154

Open
@GMNGeoffrey

Description

@GMNGeoffrey

Problem Description

Following guidance in the HIP porting guide, HIP cpp extensions guide, and ROCm 6.4 release notes, I'm trying to modify some code that has been ported to HIP to check the warp size at runtime. There are existing kernels that use warp size as a template parameter, so I followed the general technique described in the reduction tutorial of using a switch to promote warp size to a compile time constant (they use some neat C++20 tricks that I alas cannot, however).

I'm using a MI200 (gfx90a) GPU right now, but I want to also be able to compile for RDNA. But if I try to compile my code with --offload-arch=gfx1100 I hit static asserts in amd_hip_cooperative_groups.h. Just instantiating a template that would have a tile size of 64 is a compilation error.

static assertion failed: Tile size is either not a power of 2 or greater than the wavefront size In file included from rocm-dynamic-warp-size-test.hip:2: In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38: /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:660:17: error: static assertion failed due to requirement 'integral_constant::value': Tile size is either not a power of 2 or greater than the wavefront size 660 | static_assert(is_valid_tile_size::value, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:746:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here 746 | class thread_block_tile_type : public thread_block_tile_base, | ^ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:843:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64, cooperative_groups::thread_block>' requested here 843 | class thread_block_tile_internal : public thread_block_tile_type { | ^ /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:856:34: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, cooperative_groups::thread_block>' requested here 856 | class thread_block_tile : public impl::thread_block_tile_internal { | ^ rocm-dynamic-warp-size-test.hip:24:7: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, cooperative_groups::thread_block>' requested here 24 | cg::tiled_partition(cg::this_thread_block()); | ^ rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here 57 | kernel<<<1, block_size, 0, stream>>>(d_counter); | ^
static assertion failed: Tiled partition with size > wavefront size. Currently not supported In file included from rocm-dynamic-warp-size-test.hip:2: In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38: /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:899:17: error: static assertion failed due to requirement 'integral_constant::value': Tiled partition with size > wavefront size. Currently not supported 899 | static_assert(is_valid_tile_size::value, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rocm-dynamic-warp-size-test.hip:24:11: note: in instantiation of function template specialization 'cooperative_groups::tiled_partition<64U, cooperative_groups::thread_block>' requested here 24 | cg::tiled_partition(cg::this_thread_block()); | ^ rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here 57 | kernel<<<1, block_size, 0, stream>>>(d_counter); | ^ 2 errors generated when compiling for gfx1100.

I'm especially confused because I get the impression that RDNA is the main reason for insisting that warp size is dynamic. From the docs:

RDNA architectures have a configurable wavefront size. The native wavefront size is 32, but they can run in “CU mode”, which has an effective wavefront size of 64. This affects the number of resident wavefronts and blocks per compute Unit.

https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html#id13

AFAICT this is something that's decided at runtime

RDNA architectures [warp size] can even differ between kernel launches, depending on whether they run in CU or WGP mode... Since warpSize can differ between devices, it can not be assumed to be a compile-time constant on the host

https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size

It seems like HIP itself is still assuming that warp size is a compile-time constant. How is this supposed to work? I can't just trivially drop the template parameter and make the kernel look up the warp size itself because it's used for things like array sizes.

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

AMD Ryzen Threadripper PRO 7975WX 32-Cores

GPU

AMD Instinct MI210

ROCm Version

6.4

ROCm Component

clr

Steps to Reproduce

rocm-dynamic-warp-size-test.hip
// rocm-dynamic-warp-size-test.hip
#include <hip/hip_cooperative_groups.h>

#include "hip/hip_runtime.h"

namespace cg = cooperative_groups;

#define HIP_CHECK(val)                     \
  {                                        \
    hip_check_((val), __FILE__, __LINE__); \
  }

class HipException : public std::runtime_error {
 public:
  HipException(const std::string& what) : runtime_error(what) {}
};

inline void hip_check_(hipError_t val, const char* file, int line) {
  if (val != hipSuccess) {
    throw HipException(
        std::string(file) + ":" + std::to_string(line) + ": HIP error " +
        std::to_string(val) + ": " + hipGetErrorString(val));
  }
}

template <int warp_size>
__global__ void kernel(size_t* d_counter) {
  cg::thread_block_tile<warp_size> warp_tile =
      cg::tiled_partition<warp_size>(cg::this_thread_block());
  const size_t lane_idx = warp_tile.thread_rank();
  atomicAdd(d_counter, 1);
}

int main(int argc, char** argv) {
  hipStream_t stream;
  size_t h_counter;
  size_t* d_counter;

  HIP_CHECK(hipMalloc((void**)&d_counter, sizeof(size_t)));
  HIP_CHECK(hipStreamCreate(&stream));

  HIP_CHECK(hipStreamSynchronize(stream));

  int device;
  HIP_CHECK(hipGetDevice(&device));
  int warp_size = 0;
  HIP_CHECK(
      hipDeviceGetAttribute(&warp_size, hipDeviceAttributeWarpSize, device));

  switch (warp_size) {
    case 32: {
      constexpr int warp_size = 32;
      constexpr int block_count = 1;
      constexpr int block_size = warp_size;
      kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
    } break;
    case 64: {
      constexpr int warp_size = 64;
      constexpr int block_count = 1;
      constexpr int block_size = warp_size;
      kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
    } break;
    default:
      throw HipException(
          "Expected warp size 32 or 64 but got " + std::to_string(warp_size));
  }

  HIP_CHECK(hipMemcpyAsync(
      &h_counter, d_counter, sizeof(size_t), hipMemcpyDeviceToHost, stream));
  HIP_CHECK(hipStreamSynchronize(stream));
  printf("h_counter=%zu\n", h_counter);

  HIP_CHECK(hipStreamDestroy(stream));
  HIP_CHECK(hipFree(d_counter));
  return 0;
}
dynamic_warp_size_test.sh
#!/bin/bash
# dynamic_warp_size_test.sh

set -euo pipefail

BASENAME=rocm-dynamic-warp-size-test
ARCH="$1"

/opt/rocm/lib/llvm/bin/clang++ \
    -DHIP_ENABLE_WARP_SYNC_BUILTINS \
    -D__HIP_ROCclr__=1 \
    -g3 \
    -ggdb3 \
    --offload-arch="${ARCH}" \
    -fPIE \
    -O1 \
    -std=gnu++17 \
    -o "${BASENAME}.hip.o" \
    -x hip \
    -c "${BASENAME}.hip"

/opt/rocm/lib/llvm/bin/clang++ \
   -O0 \
   -g3 \
   -ggdb3 \
   --offload-arch=gfx90a \
   --hip-link \
   --rtlib=compiler-rt \
   -unwindlib=libgcc \
   "${BASENAME}.hip.o" \
   -o "${BASENAME}" \
   /opt/rocm/lib/libamdhip64.so.6.4.60400

"./${BASENAME}"
./dynamic_warp_size_test.sh gfx90a
./dynamic_warp_size_test.sh gfx1100

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions