[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile

### Problem Description

The argument to some GPU functions is frequently only initialized in some threads. Clang created the `maybe_undef` annotation for [precisely this use case](https://discourse.llvm.org/t/llvm-dev-rfc-d130224-introduce-maybe-undef-attribute-for-function-arguments-which-accepts-undef-values/63980). Calling `cooperative_groups::thread_block_tile_base::shfl` with a maybe-uninitialized argument results in a miscompile (at least vs the user intent. Not sure if the library or the compiler is at fault) where the compiler assumes that all threads have to take a branch where the variable is initialized.


### Operating System

22.04.5 LTS (Jammy Jellyfish)

### CPU

AMD Ryzen Threadripper PRO 7975WX 32-Cores

### GPU

AMD Instinct MI210

### ROCm Version

ROCm 6.2.4

### ROCm Component

clr

### Steps to Reproduce

```C++
#include "hip/hip_runtime.h"
#include <hip/hip_cooperative_groups.h>

namespace cg = cooperative_groups;

#define HIP_CHECK(val) \
  { hip_check_((val), __FILE__, __LINE__); }

class HipException : public std::runtime_error {
 public:
  HipException(const std::string& what) : runtime_error(what) {}
};

inline void hip_check_(hipError_t val, const char* file, int line) {
  if (val != hipSuccess) {
    throw HipException(std::string(file) + ":" + std::to_string(line) + ": HIP error " +
                        std::to_string(val) + ": " + hipGetErrorString(val));
  }
}

template <int warp_size>
__global__ void get_kernel(size_t* d_block_count) {
  cg::thread_block_tile<warp_size> warp_tile =
      cg::tiled_partition<warp_size>(cg::this_thread_block());
  const size_t lane_idx = warp_tile.thread_rank();

  size_t warp_position;
  // size_t warp_position = 0;
  printf(
      "Before branch: lane_idx=%zu, recomputed_lane_idx=%u, d_block_count=%zu\n",
      lane_idx, warp_tile.thread_rank(), *d_block_count);
  if (lane_idx == 0) {
    warp_position = atomicAdd(d_block_count, 1);
    printf(
      "Incremented d_block_count: lane_idx=%zu, recomputed_lane_idx=%u, d_block_count=%zu, warp_position=%zu\n",
      lane_idx, warp_tile.thread_rank(), *d_block_count, warp_position);
  }
  warp_position = warp_tile.shfl(warp_position, 0);
  // warp_position used as index into an array or something
}

int main(int argc, char** argv) {
  hipStream_t stream;
  size_t h_block_count;
  size_t* d_block_count;

  HIP_CHECK(hipMalloc((void**)&d_block_count, sizeof(size_t)));
  HIP_CHECK(hipStreamCreate(&stream));

  HIP_CHECK(hipStreamSynchronize(stream));

  constexpr int warp_size = 64;
  constexpr int block_count = 1;
  constexpr int block_size = 64;
  get_kernel<warp_size><<<block_count, block_size, 0, stream>>>(d_block_count);

  HIP_CHECK(hipMemcpyAsync(&h_block_count, d_block_count, sizeof(size_t),
                              hipMemcpyDeviceToHost, stream));
  HIP_CHECK(hipStreamSynchronize(stream));
  printf("h_block_count=%zu\n", h_block_count);
  assert(h_block_count == block_count);

  HIP_CHECK(hipStreamDestroy(stream));
  HIP_CHECK(hipFree(d_block_count));
  return 0;
}
```

Execute with

```bash
#!/bin/bash

/opt/rocm/lib/llvm/bin/clang++ \
    -DHIP_ENABLE_WARP_SYNC_BUILTINS \
    -D__HIP_ROCclr__=1 \
    -g3 \
    -ggdb3 \
    --offload-arch=gfx90a \
    -fPIE \
    -O1 \
    -std=gnu++17 \
    -o miscompile_test.hip.o \
    -x hip \
    -c miscompile_test.hip


/opt/rocm/lib/llvm/bin/clang++ \
    -O0 \
    -g3 \
    -ggdb3 \
    --offload-arch=gfx90a \
    --hip-link \
    --rtlib=compiler-rt \
    -unwindlib=libgcc \
    miscompile_test.hip.o \
    -o miscompile_test \
    /opt/rocm/lib/libamdhip64.so.6.2.60204

./miscompile_test
```

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

<details>

```
ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen Threadripper PRO 7975WX 32-Cores
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen Threadripper PRO 7975WX 32-Cores
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   5352
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            64
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90a
  Uuid:                    GPU-29273e62ca00396c
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   58112
  Internal Node ID:        1
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx90a
  Uuid:                    GPU-beaf2359215679f8
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   49920
  Internal Node ID:        2
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 4
*******
  Name:                    gfx90a
  Uuid:                    GPU-0c409370be5107fb
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   768
  Internal Node ID:        3
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 5
*******
  Name:                    gfx90a
  Uuid:                    GPU-3644c2b4f8c805bb
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    4
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   1536
  Internal Node ID:        4
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
```

</details>

### Additional Information

With normal C++ semantics, if `warp_position` is uninitialized, the only way it gets defined before it gets shuffled is if it enters the if block, so the compiler concludes that it has to have entered the `if` block and all threads do so.

The `_shfl` operations have `__attribute__((maybe_undef))`:

https://github.com/ROCm/clr/blob/3c863dad9146be24ccec93816c3cb0752d40d9ca/hipamd/include/hip/amd_detail/amd_warp_functions.h#L130-L136

The groups-level version don't though:

https://github.com/ROCm/clr/blob/3c863dad9146be24ccec93816c3cb0752d40d9ca/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h#L451-L474

I'm not really clear on whether that's something that would be propagated to the calling function.

As an aside, the documentation also says that the value argument can only be a 32-bit int or float, which is a bit weird, especially since it's templated and has a `static_assert` that just checks if it's integral or float, and delegates to the warp functions that support way more types than that. That documentation was [just added](https://github.com/ROCm/clr/commit/3c863dad9146be24ccec93816c3cb0752d40d9ca). Seems maybe wrong? Changing `warp_position` to `int32_t` doesn't fix anything though.

I tried throwing `MAYBE_UNDEF` (the HIP Macro around the clang attribute) on `thread_block_tile_base::shfl` (confirmed that this is the one getting called via printf and the debugger), but that doesn't fix the issue.

Calling `__shfl` directly like `__shfl(warp_position, 0, 64)` also fixes the issue. I'm not sure whether this is a bug in ROCm or [hip] clang.

	__device__
	inline
	int __shfl(MAYBE_UNDEF int var, int src_lane, int width = warpSize) {
	int self = __lane_id();
	int index = (src_lane & (width - 1)) + (self & ~(width-1));
	return __builtin_amdgcn_ds_bpermute(index<<2, var);
	}

	/** \brief Shuffle operation on group level.
	*
	* \details Exchanging variables between threads without use of shared memory.
	* Shuffle operation is a direct copy of ``var`` from ``srcRank``
	* thread ID of group.
	*
	* \tparam T The type can be a 32-bit integer or single-precision
	* floating point.
	* \param var [in] The source variable to copy. Only the srcRank thread ID of
	* group is copied to other threads.
	* \param srcRank [in] The source thread ID of the group for copy.
	*/
	template <class T>
	__CG_QUALIFIER__ T shfl(T var, int srcRank) const {
	static_assert(is_valid_type<T>::value, "Neither an integer or float type.");

	srcRank = srcRank % static_cast<int>(size());

	int lane = (size() == __AMDGCN_WAVEFRONT_SIZE) ? srcRank
	: (__AMDGCN_WAVEFRONT_SIZE == 64) ? __fns64(coalesced_info.member_mask, 0, (srcRank + 1))
	: __fns32(coalesced_info.member_mask, 0, (srcRank + 1));

	return __shfl(var, lane, __AMDGCN_WAVEFRONT_SIZE);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: maybe_undef on cooperative_groups::thread_block_tile_base::shfl fails causing miscompile #146

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146