Skip to content

BFloat16.jl support in kernels #2441

Open
@maleadt

Description

@maleadt

Julia 1.11 introduces BFloat16 codegen support, so let's use this issue to track support for that.

Right now, it looks like we support the type, but somehow still emit conversions:

julia> BFloat16s.llvm_storage
true

julia> BFloat16s.llvm_arithmetic
true

julia> function kernel(x)
       @inbounds x[threadIdx().x] += BFloat16(1)
         return
       end

julia> x = CuArray{BFloat16}(undef, 1024);

julia> @device_code_llvm debuginfo=:none @cuda kernel(x)
; PTX CompilerJob of MethodInstance for kernel(::CuDeviceVector{BFloat16, 1}) for sm_89
define ptx_kernel void @_Z6kernel13CuDeviceArrayI8BFloat16Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = bitcast i8 addrspace(1)* %.fca.0.extract to bfloat addrspace(1)*
  %3 = zext i32 %1 to i64
  %4 = getelementptr inbounds bfloat, bfloat addrspace(1)* %2, i64 %3
  %5 = load bfloat, bfloat addrspace(1)* %4, align 2
  %6 = fpext bfloat %5 to float
  %7 = fadd float %6, 1.000000e+00
  %8 = fptrunc float %7 to bfloat
  store bfloat %8, bfloat addrspace(1)* %4, align 2
  ret void
}

In addition, the logic in BFloat16s.jl isn't great, as we determine support based on the host processor. It's not clear if we can do better though; this looks a lot like the literal Int issue (where we can't make GPU code use Int32 when the host is Int64).

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda kernelsStuff about writing CUDA kernels.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions