Open
Description
Julia 1.11 introduces BFloat16 codegen support, so let's use this issue to track support for that.
Right now, it looks like we support the type, but somehow still emit conversions:
julia> BFloat16s.llvm_storage
true
julia> BFloat16s.llvm_arithmetic
true
julia> function kernel(x)
@inbounds x[threadIdx().x] += BFloat16(1)
return
end
julia> x = CuArray{BFloat16}(undef, 1024);
julia> @device_code_llvm debuginfo=:none @cuda kernel(x)
; PTX CompilerJob of MethodInstance for kernel(::CuDeviceVector{BFloat16, 1}) for sm_89
define ptx_kernel void @_Z6kernel13CuDeviceArrayI8BFloat16Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
%1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = bitcast i8 addrspace(1)* %.fca.0.extract to bfloat addrspace(1)*
%3 = zext i32 %1 to i64
%4 = getelementptr inbounds bfloat, bfloat addrspace(1)* %2, i64 %3
%5 = load bfloat, bfloat addrspace(1)* %4, align 2
%6 = fpext bfloat %5 to float
%7 = fadd float %6, 1.000000e+00
%8 = fptrunc float %7 to bfloat
store bfloat %8, bfloat addrspace(1)* %4, align 2
ret void
}
In addition, the logic in BFloat16s.jl isn't great, as we determine support based on the host processor. It's not clear if we can do better though; this looks a lot like the literal Int
issue (where we can't make GPU code use Int32
when the host is Int64
).