Open
Description
Describe the bug
Any use of shfl_sync
throws an error saying shfl_recurse
is a dynamic function.
To reproduce
The Minimal Working Example (MWE) for this bug:
Attempting to do a stream compaction:
using CUDA
# define a new arrays of 64 elements, and fill it with random ones and zeros
a = rand(0:1, 64)
a_gpu = CuArray(a)
b_gpu = CUDA.zeros(Int64, 64)
count = CUDA.zeros(Int64, 1)
function mykernel!(in, out, count)
threadNum = threadIdx().x + blockDim().x * (blockIdx().x-1) # 1-indexed
warpNum = (threadIdx().x - 1) ÷ 32 # 0-indexed
laneNum = (threadIdx().x - 1) % 32 # 0-indexed
shared_count = CuDynamicSharedArray(Int64, 1)
if threadNum == 1
shared_count[1] = 0
end
sync_threads()
if threadNum <= 64
is_nonzero = in[threadNum] != 0
mask = CUDA.vote_ballot_sync(0xffffffff, is_nonzero)
warp_count = count_ones(mask)
warp_offset = 0
if laneNum == 0
warp_offset = CUDA.atomic_add!(pointer(shared_count, 1), warp_count)
end
warp_offset = CUDA.shfl_sync(0xffffffff, warp_offset, Int32(0)) #<<<<< This is the BUG code.
if is_nonzero
index = count_ones(mask & ((1u << laneNum) - 1)) + warp_offset
out[index+1] = threadNum
end
end
sync_threads()
if threadIdx().x == 1
CUDA.atomic_add!(CUDA.pointer(count), shared_count[1])
end
return
end
@cuda threads = 64 blocks = 1 shmem=sizeof(Int64) mykernel!(a_gpu, b_gpu, count)
println("nonzeros:$(collect(count))")
println(collect(b_gpu))
Manifest.toml
Package versions:
Status `~/.julia/environments/v1.11/Project.toml`
[052768ef] CUDA v5.5.2
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
CUDA details:
CUDA runtime version: 12.6.0
CUDA driver version: 12.6.0
CUDA capability: 9.0.0
Expected behavior
Expected behavior is that the shuffle function doesn't throw an error, and all zeros in a
get removed when moved to b
Version info
Details on Julia:
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Platinum 8462Y+
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Details on CUDA:
CUDA driver 12.6
NVIDIA driver 550.90.7
CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.90.7
Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6
1 device:
0: NVIDIA H100 80GB HBM3 (sm_90, 77.409 GiB / 79.647 GiB available)