[Issue]: How to actually build for dynamic warp size (cooperative groups `static_assert`)

### Problem Description

Following guidance in the [HIP porting guide](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_porting_guide.html#warpsize), [HIP cpp extensions guide](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size), and [ROCm 6.4 release notes](https://github.com/ROCm/ROCm/releases/tag/rocm-6.4.0#:~:text=AMDGPU%20wavefront%20size%20compiler%20macro%20deprecation), I'm trying to modify some code that has been ported to HIP to check the warp size at runtime. There are existing kernels that use warp size as a template parameter, so I followed the general technique described in the [reduction tutorial](https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/reduction.html) of using a switch to promote warp size to a compile time constant (they use some neat C++20 tricks that I alas cannot, however).

I'm using a MI200 (gfx90a) GPU right now, but I want to also be able to compile for RDNA. But if I try to compile my code with `--offload-arch=gfx1100` I hit static asserts in amd_hip_cooperative_groups.h. Just instantiating a template that would have a tile size of 64 is a compilation error.

<details>
<summary><code>static assertion failed: Tile size is either not a power of 2 or greater than the wavefront size</code></summary>
<code>
In file included from rocm-dynamic-warp-size-test.hip:2:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:660:17: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size
  660 |   static_assert(is_valid_tile_size<size>::value,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:746:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here
  746 | class thread_block_tile_type : public thread_block_tile_base<tileSize>,
      |                                       ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:843:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64, cooperative_groups::thread_block>' requested here
  843 | class thread_block_tile_internal : public thread_block_tile_type<size, ParentCGTy> {
      |                                           ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:856:34: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, cooperative_groups::thread_block>' requested here
  856 | class thread_block_tile : public impl::thread_block_tile_internal<size, ParentCGTy> {
      |                                  ^
rocm-dynamic-warp-size-test.hip:24:7: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, cooperative_groups::thread_block>' requested here
   24 |       cg::tiled_partition<warp_size>(cg::this_thread_block());
      |       ^
rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here
   57 |         kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
      |         ^
</code>
</details>

<details><summary><code>static assertion failed: Tiled partition with size > wavefront size. Currently not supported</code></summary>
<code>
In file included from rocm-dynamic-warp-size-test.hip:2:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:899:17: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tiled partition with size > wavefront size. Currently not supported
  899 |   static_assert(is_valid_tile_size<size>::value,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rocm-dynamic-warp-size-test.hip:24:11: note: in instantiation of function template specialization 'cooperative_groups::tiled_partition<64U, cooperative_groups::thread_block>' requested here
   24 |       cg::tiled_partition<warp_size>(cg::this_thread_block());
      |           ^
rocm-dynamic-warp-size-test.hip:57:9: note: in instantiation of function template specialization 'kernel<64>' requested here
   57 |         kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
      |         ^
2 errors generated when compiling for gfx1100.
</code>
</details>

I'm especially confused because I get the impression that RDNA is the main reason for insisting that warp size is dynamic. From the docs:

> RDNA architectures have a configurable wavefront size. The native wavefront size is 32, but they can run in “CU mode”, which has an effective wavefront size of 64. This affects the number of resident wavefronts and blocks per compute Unit.

https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html#id13

AFAICT this is something that's decided at runtime

> RDNA architectures [warp size] can even differ between kernel launches, depending on whether they run in CU or WGP mode... Since warpSize can differ between devices, it can not be assumed to be a compile-time constant on the host

https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size

It seems like HIP itself is still assuming that warp size is a compile-time constant. How is this supposed to work? I can't just trivially drop the template parameter and make the kernel look up the warp size itself because it's used for things like array sizes.

### Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

### CPU

AMD Ryzen Threadripper PRO 7975WX 32-Cores

### GPU

AMD Instinct MI210

### ROCm Version

6.4

### ROCm Component

clr

### Steps to Reproduce

<details><summary><code>rocm-dynamic-warp-size-test.hip</code></summary>

```c++
// rocm-dynamic-warp-size-test.hip
#include <hip/hip_cooperative_groups.h>

#include "hip/hip_runtime.h"

namespace cg = cooperative_groups;

#define HIP_CHECK(val)                     \
  {                                        \
    hip_check_((val), __FILE__, __LINE__); \
  }

class HipException : public std::runtime_error {
 public:
  HipException(const std::string& what) : runtime_error(what) {}
};

inline void hip_check_(hipError_t val, const char* file, int line) {
  if (val != hipSuccess) {
    throw HipException(
        std::string(file) + ":" + std::to_string(line) + ": HIP error " +
        std::to_string(val) + ": " + hipGetErrorString(val));
  }
}

template <int warp_size>
__global__ void kernel(size_t* d_counter) {
  cg::thread_block_tile<warp_size> warp_tile =
      cg::tiled_partition<warp_size>(cg::this_thread_block());
  const size_t lane_idx = warp_tile.thread_rank();
  atomicAdd(d_counter, 1);
}

int main(int argc, char** argv) {
  hipStream_t stream;
  size_t h_counter;
  size_t* d_counter;

  HIP_CHECK(hipMalloc((void**)&d_counter, sizeof(size_t)));
  HIP_CHECK(hipStreamCreate(&stream));

  HIP_CHECK(hipStreamSynchronize(stream));

  int device;
  HIP_CHECK(hipGetDevice(&device));
  int warp_size = 0;
  HIP_CHECK(
      hipDeviceGetAttribute(&warp_size, hipDeviceAttributeWarpSize, device));

  switch (warp_size) {
    case 32: {
      constexpr int warp_size = 32;
      constexpr int block_count = 1;
      constexpr int block_size = warp_size;
      kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
    } break;
    case 64: {
      constexpr int warp_size = 64;
      constexpr int block_count = 1;
      constexpr int block_size = warp_size;
      kernel<warp_size><<<1, block_size, 0, stream>>>(d_counter);
    } break;
    default:
      throw HipException(
          "Expected warp size 32 or 64 but got " + std::to_string(warp_size));
  }

  HIP_CHECK(hipMemcpyAsync(
      &h_counter, d_counter, sizeof(size_t), hipMemcpyDeviceToHost, stream));
  HIP_CHECK(hipStreamSynchronize(stream));
  printf("h_counter=%zu\n", h_counter);

  HIP_CHECK(hipStreamDestroy(stream));
  HIP_CHECK(hipFree(d_counter));
  return 0;
}

```

</details>

<details><summary><code>dynamic_warp_size_test.sh</code></summary>

```bash
#!/bin/bash
# dynamic_warp_size_test.sh

set -euo pipefail

BASENAME=rocm-dynamic-warp-size-test
ARCH="$1"

/opt/rocm/lib/llvm/bin/clang++ \
    -DHIP_ENABLE_WARP_SYNC_BUILTINS \
    -D__HIP_ROCclr__=1 \
    -g3 \
    -ggdb3 \
    --offload-arch="${ARCH}" \
    -fPIE \
    -O1 \
    -std=gnu++17 \
    -o "${BASENAME}.hip.o" \
    -x hip \
    -c "${BASENAME}.hip"

/opt/rocm/lib/llvm/bin/clang++ \
   -O0 \
   -g3 \
   -ggdb3 \
   --offload-arch=gfx90a \
   --hip-link \
   --rtlib=compiler-rt \
   -unwindlib=libgcc \
   "${BASENAME}.hip.o" \
   -o "${BASENAME}" \
   /opt/rocm/lib/libamdhip64.so.6.4.60400

"./${BASENAME}"
```

</details>

```shell
./dynamic_warp_size_test.sh gfx90a
./dynamic_warp_size_test.sh gfx1100
```



### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: How to actually build for dynamic warp size (cooperative groups `static_assert`) #154

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: How to actually build for dynamic warp size (cooperative groups static_assert) #154

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Issue]: How to actually build for dynamic warp size (cooperative groups `static_assert`) #154