Skip to content

Unconditional errors result in dynamic invocations #649

Open
@kichappa

Description

@kichappa

Describe the bug

Any use of shfl_sync throws an error saying shfl_recurse is a dynamic function.

To reproduce

The Minimal Working Example (MWE) for this bug:

Attempting to do a stream compaction:

using CUDA

# define a new arrays of 64 elements, and fill it with random ones and zeros
a = rand(0:1, 64)

a_gpu = CuArray(a)
b_gpu = CUDA.zeros(Int64, 64)
count = CUDA.zeros(Int64, 1)

function mykernel!(in, out, count)
	threadNum = threadIdx().x + blockDim().x * (blockIdx().x-1) # 1-indexed
	warpNum = (threadIdx().x - 1) ÷ 32 # 0-indexed
	laneNum = (threadIdx().x - 1) % 32 # 0-indexed

    shared_count = CuDynamicSharedArray(Int64, 1)
    
    if threadNum == 1
        shared_count[1] = 0
    end
    sync_threads()

    if threadNum <= 64
        is_nonzero = in[threadNum] != 0
        mask = CUDA.vote_ballot_sync(0xffffffff, is_nonzero)
        warp_count = count_ones(mask)

        warp_offset = 0
        if laneNum == 0
            warp_offset = CUDA.atomic_add!(pointer(shared_count, 1), warp_count)
        end
        warp_offset = CUDA.shfl_sync(0xffffffff, warp_offset, Int32(0)) #<<<<< This is the BUG code.

        if is_nonzero
            index = count_ones(mask & ((1u << laneNum) - 1)) + warp_offset
            out[index+1] = threadNum
        end
    end
    sync_threads()

    if threadIdx().x == 1
        CUDA.atomic_add!(CUDA.pointer(count), shared_count[1])
    end
	return
end

@cuda threads = 64 blocks = 1 shmem=sizeof(Int64) mykernel!(a_gpu, b_gpu, count)

println("nonzeros:$(collect(count))")
println(collect(b_gpu))
Manifest.toml

Package versions:
Status `~/.julia/environments/v1.11/Project.toml`
  [052768ef] CUDA v5.5.2
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`

CUDA details:
CUDA runtime version: 12.6.0
CUDA driver version: 12.6.0
CUDA capability: 9.0.0

Expected behavior

Expected behavior is that the shuffle function doesn't throw an error, and all zeros in a get removed when moved to b

Version info

Details on Julia:

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8462Y+
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)

Details on CUDA:

CUDA driver 12.6
NVIDIA driver 550.90.7

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.90.7

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA H100 80GB HBM3 (sm_90, 77.409 GiB / 79.647 GiB available)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions