Open
Description
As reported in luraess/JuliaGPUPerf#2 and luraess/JuliaGPUPerf#3, there is an issue significantly affecting performance when doing ^
operation within GPU kernels.
The Int32 on Int32 case (luraess/JuliaGPUPerf#2) may have been fixed as upon suggestion from @vchuravy by using
my_pow(x, p) = ccall("llvm.powi.f32.i32", llvmcall, Float32, (Float32, Int32), x, p)
#[...]
A[ix,iy] = B[ix,iy] + s*my_pow(C[ix,iy], pow_int)
But the Float32 and Float64 cases are still lacking behind.