WMMA TensorFloat32 (TF32) #1419

carstenbauer · 2022-03-01T20:11:51Z

In this PR I will try to add TensorFloat32 (TF32) support for WMMAs. I'm surely the wrong person for the job, but let's see where things will take me / us 😄

Some resources:

TF32 format: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/ and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-altfp
Element types and matrix sizes: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-type-sizes
PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions

cc: @HenriDeh @thomasfaingnaert

carstenbauer · 2022-03-01T20:18:39Z

The initial naive attempt didn't work. For the following test kernel,

function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32_tf32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
    return nothing
end

function call_kernel()
    m = n = 16
    k = 8
    dtype_a = dtype_b = Float32
    dtype_c = dtype_d = Float32

    d_a = CUDA.rand(dtype_a, m, k)
    d_b = CUDA.rand(dtype_b, k, n)
    d_c = CUDA.rand(dtype_c, m, n)
    d_d = CUDA.zeros(dtype_d, m, n)

    CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
    return nothing
end

I get

julia> call_kernel()                                                                                                                                    
ERROR: InvalidIRError: compiling kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, Cu
DeviceMatrix{Float32, 1}) resulted in invalid LLVM IR                                                                                                   
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8)                                           
Stacktrace:                                                                                                                                             
 [1] llvm_wmma_load_a_col_m16n16k8_global_stride_tf32                                                                                                   
   @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214                                               
 [2] kernel_wmma_tf32_lowlevel                                                                                                                          
   @ ./REPL[2]:2                                                                                                                                        
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8)                                           
Stacktrace:                                                                                                                                             
 [1] llvm_wmma_load_b_col_m16n16k8_global_stride_tf32                                                                                                   
   @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214                                               
 [2] kernel_wmma_tf32_lowlevel                                                                                                                          
   @ ./REPL[2]:3                                                                                                                                        
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8)                                            
Stacktrace:                                                                                                                                             
 [1] llvm_wmma_load_c_col_m16n16k8_global_stride_f32                                                                                                    
   @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214                                               
 [2] kernel_wmma_tf32_lowlevel                                                                                                                          
   @ ./REPL[2]:4                                                                                                                                        
Reason: unsupported use of an undefined name (use of 'llvm_wmma_mma_col_col_m16n16k8_tf32_tf32')                                                        
Stacktrace:                                                                                                                                             
 [1] getproperty                                                                                                                                        
   @ ./Base.jl:35                                                                                                                                       
 [2] kernel_wmma_tf32_lowlevel                                                                                                                          
   @ ./REPL[2]:6
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] kernel_wmma_tf32_lowlevel
   @ ./REPL[2]:6
Reason: unsupported dynamic function invocation (call to llvm_wmma_store_d_col_m16n16k8_global_stride_f32)
Stacktrace:
 [1] kernel_wmma_tf32_lowlevel
   @ ./REPL[2]:8
HINT: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel_wmma_tf32_lowlevel), NTuple{4, CuDeviceMatrix{Float32, 1}}}}, args::LLVM.Module)
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/validation.jl:119
  [2] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:327 [inlined]
  [3] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:325 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:326
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90
  [8] cufunction(f::typeof(kernel_wmma_tf32_lowlevel), tt::Type{NTuple{4, CuDeviceMatrix{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:297
  [9] cufunction
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:291 [inlined]
 [10] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:102 [inlined]
 [11] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/utilities.jl:25 [inlined]
 [12] call_kernel()
    @ Main ./REPL[3]:12
 [13] top-level scope
    @ REPL[4]:1
 [14] top-level scope
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/initialization.jl:52

Any form of help would be highly appreciated :)

thomasfaingnaert · 2022-03-01T21:53:26Z

On first sight, there are two issues:

This error means that the name of your wrapper is not correct:

Reason: unsupported use of an undefined name (use of 'llvm_wmma_mma_col_col_m16n16k8_tf32_tf32')

This name is generated at

CUDA.jl/src/device/intrinsics/wmma.jl

Line 334 in c24234d

    
           func_name = Symbol(join(filter(!isempty, ["llvm", "wmma", "mma", a_layout, b_layout, shape, d_elem_type, c_elem_type]), "_"))

, and with your definition at

CUDA.jl/src/device/intrinsics/wmma.jl

Line 103 in c24234d

const wmma_tf32_ops = [(16,16,8)], ["tf32"], ["f32"], ["f32"]

, the correct name should be WMMA.llvm_wmma_mma_col_col_m16n16k8_f32_f32 instead of WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32_tf32.

This type of errors indicates that the name of the LLVM IR intrinsic is wrong:

Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8)

Have you verified that this is the correct name of the intrinsic? You can do so by compiling a CUDA C source file that uses TF32 WMMA using Clang, and using the command-line option -S -emit-llvm.
If the name is correct, we're likely missing the latest NVPTX patches in Julia.

thomasfaingnaert · 2022-03-02T07:46:44Z

Have you verified that this is the correct name of the intrinsic? You can do so by compiling a CUDA C source file that uses TF32 WMMA using Clang, and using the command-line option -S -emit-llvm.

Actually, it might be easier to just look at the tests of Clang for the expected intrinsic names.
This unit test seems to contain all the intrinsic names for WMMA: https://github.com/llvm/llvm-project/blob/main/clang/test/CodeGen/builtins-nvptx-mma.cu.

carstenbauer · 2022-03-02T16:53:43Z

Thanks for your comments. I fixed (1):

function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_f32_f32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
    return nothing
end

function call_kernel()
    m = n = 16
    k = 8
    dtype_a = dtype_b = Float32
    dtype_c = dtype_d = Float32

    d_a = CUDA.rand(dtype_a, m, k)
    d_b = CUDA.rand(dtype_b, k, n)
    d_c = CUDA.rand(dtype_c, m, n)
    d_d = CUDA.zeros(dtype_d, m, n)

    CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
    return nothing
end

As for (2), here it says

llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32

so I think that the intrinsic is correct. (Am I right in assuming that the names in the comments are the ones to check?)

If the name is correct, we're likely missing the latest NVPTX patches in Julia.

Can you elaborate? Do you mean that our LLVM is too old and we need specific NVPTX patches / backports? If so, I guess that would imply that there is nothing that can be done here (and that things can only work from Julia 1.9 on?)

carstenbauer · 2022-03-02T17:20:03Z

I should probably say that I've used Julia 1.7 above. Trying Julia 1.8-beta1, which uses LLVM 13 (instead of 12) the loads seem to pass and I get

julia> call_kernel()                                                                                                                                                               
ERROR: InvalidIRError: compiling kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) r
esulted in invalid LLVM IR                                                                                                                                                         
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.mma.col.col.f32.f32)                                                                              
Stacktrace:                                                                                                                                                                        
 [1] llvm_wmma_mma_col_col_m16n16k8_f32_f32                                                                                                                                        
   @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:355                                                                          
 [2] kernel_wmma_tf32_lowlevel                                                                                                                                                     
   @ ./REPL[3]:6                                                                                                                                                                   
HINT: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code                                                                 
Stacktrace:                                                                                                                                                                        
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel_wmma_tf32_lowlevel), NTuple{4, CuDeviceM
atrix{Float32, 1}}}}, args::LLVM.Module)                                                                                                                                           
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/validation.jl:119                                                                            [2] macro expansion                                                                                                                                                              
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:327 [inlined]      
  [3] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:325 [inlined]      
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)                                                                                                                             
    @ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:326
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90        
  [8] cufunction(f::typeof(kernel_wmma_tf32_lowlevel), tt::Type{NTuple{4, CuDeviceMatrix{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), 
Tuple{}}})
    @ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:297
  [9] cufunction
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:291 [inlined]
 [10] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:102 [inlined]
 [11] macro expansion
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/utilities.jl:25 [inlined]
 [12] call_kernel()
    @ Main ./REPL[4]:12
 [13] top-level scope
    @ REPL[5]:1
 [14] top-level scope
    @ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/initialization.jl:52

So, now it complains about llvm.nvvm.wmma.m16n16k8.mma.col.col.f32.f32. That seems to be a fixable mistake on my end, since the correct intrinsic seems to be llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32.

carstenbauer · 2022-03-02T17:38:51Z

Good news: with the current state (and Julia >= 1.8) the errors from above are gone with this corrected example:

function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
    return nothing
end

function call_kernel()
    m = n = 16
    k = 8
    dtype_a = dtype_b = Float32
    dtype_c = dtype_d = Float32

    d_a = CUDA.rand(dtype_a, m, k)
    d_b = CUDA.rand(dtype_b, k, n)
    d_c = CUDA.rand(dtype_c, m, n)
    d_d = CUDA.zeros(dtype_d, m, n)

    CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
    return nothing
end

Bad news: I now get a segfault instead 😄

signal (11): Segmentation fault                                                                                                                                                    
in expression starting at REPL[5]:1                                                                                                                                                
_ZN4llvm19MachineRegisterInfo17constrainRegClassENS_8RegisterEPKNS_19TargetRegisterClassEj at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/jul
ia/libLLVM-13jl.so (unknown line)                                                                                                                                                  
_ZN4llvm12InstrEmitter18AddRegisterOperandERNS_19MachineInstrBuilderENS_7SDValueEjPKNS_11MCInstrDescERNS_8DenseMapIS3_NS_8RegisterENS_12DenseMapInfoIS3_EENS_6detail12DenseMapPairI
S3_S8_EEEEbbb at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)                                            
_ZN4llvm12InstrEmitter15EmitMachineNodeEPNS_6SDNodeEbbRNS_8DenseMapINS_7SDValueENS_8RegisterENS_12DenseMapInfoIS4_EENS_6detail12DenseMapPairIS4_S5_EEEE at /scratch/pc2-mitarbeiter
/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)                                                                                     
_ZN4llvm18ScheduleDAGSDNodes12EmitScheduleERNS_26MachineInstrBundleIteratorINS_12MachineInstrELb0EEE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/
../lib/julia/libLLVM-13jl.so (unknown line)                                                                                                                                        
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)          
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknow
n line)                                                                                                                                                                            
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.899 at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM
-13jl.so (unknown line)                                                                                                                                                            
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown lin
e)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)
_ZL21LLVMTargetMachineEmitP23LLVMOpaqueTargetMachineP16LLVMOpaqueModuleRN4llvm17raw_pwrite_streamE19LLVMCodeGenFileTypePPc at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-
1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/bin/../lib/julia/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/MJqe4/lib/13/libLLVM_h.jl:947 [inlined]
emit at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/MJqe4/src/targetmachine.jl:45
mcgen at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/mcgen.jl:74
unknown function (ip: 0x14bc65bb239f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2340 [inlined]
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2522
macro expansion at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:339 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:336 [inlined]
#emit_asm#137 at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
emit_asm##kw at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:62 [inlined]
cufunction_compile at /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:326
cached_compilation at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90
#cufunction#255 at /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:297
cufunction at /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:291 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:102 [inlined]                                                      
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/utilities.jl:25 [inlined]                                                                
call_kernel at ./REPL[3]:12                                                                                                                                                        
unknown function (ip: 0x14bc65b3e19f)                                                                                                                                              
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2340 [inlined]                                                                                                    
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2522                                                                                                       
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1825 [inlined]                                                                                                   
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126                                                                                                         
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:215                                                                                                      
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:166 [inlined]                                                                                       
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:612                                                                                                       
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:750                                                                                     
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:906                                                                                              
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:850                                                                                              
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:556                                                                                                       
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:522                                                                                                       
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:750
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:906
ijl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]                                                                                                                                                    
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:151
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
#run_repl#47 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:356
jfptr_run_repl_64273.clone_1 at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2340 [inlined]                                                                                                    
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2522         
#960 at ./client.jl:419                                                                  
jfptr_YY.960_31015.clone_1 at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2340 [inlined]                                                                                                    
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2522                                                                                                       
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1825 [inlined]                                                                                                   
jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:769                                                                                                  
#invokelatest#2 at ./essentials.jl:729 [inlined]                                                                                                                                   
invokelatest at ./essentials.jl:727 [inlined]                                                                                                                                      
run_main_repl at ./client.jl:404                                                         
exec_options at ./client.jl:318                                                          
_start at ./client.jl:522
jfptr__start_59889.clone_1 at /scratch/pc2-mitarbeiter/bauerc/.julia/juliaup/julia-1.8.0-beta1+0~x64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2340 [inlined]
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2522
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1825 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:562           
jl_repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:706
main at julia-beta (unknown line)                                                        
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x400808)
Allocations: 46651882 (Pool: 46631992; Big: 19890); GC: 45
Segmentation fault (core dumped)

thomasfaingnaert · 2022-03-03T07:50:48Z

so I think that the intrinsic is correct. (Am I right in assuming that the names in the comments are the ones to check?)

Yes, exactly.

If the name is correct, we're likely missing the latest NVPTX patches in Julia.

Can you elaborate? Do you mean that our LLVM is too old and we need specific NVPTX patches / backports? If so, I guess that would imply that there is nothing that can be done here (and that things can only work from Julia 1.9 on?)

Yes, this is probably the patch that you need, and is missing in Julia 1.7: https://reviews.llvm.org/D104847.

Bad news: I now get a segfault instead

Looks like a bug in instruction selection in LLVM. I think @maleadt can help you more than I can with that issue.

carstenbauer · 2022-03-03T09:49:20Z

Note that I see a similar (the same?) segfault with f64 wmma: #1426 (comment)

maleadt · 2022-03-03T10:47:53Z

Try using an assertions build. That should become easier as soon as I tag the next version of LLVM.jl (today, normally).

carstenbauer · 2022-03-03T14:33:20Z

Try using an assertions build. That should become easier as soon as I tag the next version of LLVM.jl (today, normally).

When I use an assertions build of julia I get

julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
┌ Warning: You are using a version of Julia that links against a build of LLVM with assertions enabled.
│
│ This is not supported out-of-the-box, and you need a build of libLLVMExtra that supports this.
│ Use `deps/build_local.jl` for that, add the resulting LocalPreferences.toml to your project
│ and add a direct dependency on LLVMExtra_jll to pick up those preferences.
└ @ LLVM /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/MJqe4/src/LLVM.jl:24
[...]

AFAICT, there is no deps/build_local.jl. Can you elaborate a bit more what I should do?

carstenbauer · 2022-03-03T14:56:33Z

Alright, I ] uped and then the assertion build didn't error. I get

julia> call_kernel()
julia: /workspace/srcdir/llvm-project/llvm/include/llvm/CodeGen/SelectionDAGNodes.h:1105: llvm::SDValue::SDValue(llvm::SDNode*, unsigned int): Assertion `(!Node |
| !ResNo || ResNo < Node->getNumValues()) && "Invalid result number for the given node!"' failed.

signal (6): Aborted
in expression starting at REPL[9]:1
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__assert_fail_base.cold.0 at /lib64/libc.so.6 (unknown line)
__assert_fail at /lib64/libc.so.6 (unknown line)
_ZN4llvm7SDValueC1EPNS_6SDNodeEj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel9MorphNodeEPNS_6SDNodeEjNS_8SDVTListENS_8ArrayRefINS_7SDValueEEEj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin
/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unkno
wn line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (un
known line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.972 at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/lib
LLVM-13jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown
 line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZL21LLVMTargetMachineEmitP23LLVMOpaqueTargetMachineP16LLVMOpaqueModuleRN4llvm17raw_pwrite_streamE19LLVMCodeGenFileTypePPc at /scratch/pc2-mitarbeiter/bauerc/buil
ding/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/P1t7P/lib/13/libLLVM_h.jl:947 [inlined]
emit at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/P1t7P/src/targetmachine.jl:45
mcgen at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/mcgen.jl:74
unknown function (ip: 0x152e1961abdf)
[...]

maleadt · 2022-03-03T15:00:22Z

Can you show the IR?

carstenbauer · 2022-03-03T18:08:34Z

You want the LLVM IR, i.e. @device_code_llvm, right? I get something but then it segfaults as well.

julia> @device_code_llvm dump_module=true call_kernel()                                                                                                                                                                         [126/239]
; PTX CompilerJob of kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) for sm_80                                                          
; ModuleID = 'text'                                                                                                                                                                                                                      
source_filename = "text"                                                                                                                                                                                                                 
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"                                                                                  
target triple = "nvptx64-nvidia-cuda"                                                                                                                                                                                                    
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                     
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                     
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                      
                                                                                                                                                                                                                                         
; Function Attrs: nounwind readnone                                                                                                                                                                                                      
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float
, float, float, float, float, float, float) #1                                                                                                                                                                                           
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind writeonly                                                                                                                                                                                          
declare void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* nocapture writeonly, float, float, float, float, float, float, float, float, i32) #2                                                                  
                                                                                                                                                                                                                                         
;  @ REPL[1]:1 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
define ptx_kernel void @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE([1 x i64] %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, { i8 addrspace(1)*, i64, [
2 x i64], i64 } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3) local_unnamed_addr #3 {                                                                                                  
entry:                                                                                                                                                                                                                                   
  %.fca.0.extract72 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0                                                                                                                                                       
  %.fca.0.extract62 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %1, 0                                                                                                                                                       
  %.fca.0.extract52 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, 0                                                                                                                                                       
  %.fca.0.extract49 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0                                                                                                                                                       
;  @ REPL[1]:2 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_a_col_m16n16k8_global_stride_tf32`                                                                                             
   %4 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract72, i32 16)                                                                 
   %.fca.0.extract33 = extractvalue { float, float, float, float, float, float, float, float } %4, 0                                                                                                                                     
   %.fca.1.extract34 = extractvalue { float, float, float, float, float, float, float, float } %4, 1                                                                                                                                     
   %.fca.2.extract35 = extractvalue { float, float, float, float, float, float, float, float } %4, 2                                                                                                                                     
   %.fca.3.extract36 = extractvalue { float, float, float, float, float, float, float, float } %4, 3                                                                                                                                     
   %.fca.4.extract37 = extractvalue { float, float, float, float, float, float, float, float } %4, 4                                                                                                                                     
   %.fca.5.extract38 = extractvalue { float, float, float, float, float, float, float, float } %4, 5                                                                                                                                     
   %.fca.6.extract39 = extractvalue { float, float, float, float, float, float, float, float } %4, 6                                                                                                                                     
   %.fca.7.extract40 = extractvalue { float, float, float, float, float, float, float, float } %4, 7                                                                                                                                     
; └                                                                                                                                                                                                                                      
;  @ REPL[1]:3 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_b_col_m16n16k8_global_stride_tf32`                                                                                             
   %5 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract62, i32 8)                                                                  
   %.fca.0.extract17 = extractvalue { float, float, float, float, float, float, float, float } %5, 0                                                                                                                                     
   %.fca.1.extract18 = extractvalue { float, float, float, float, float, float, float, float } %5, 1                                                                                                                                     
   %.fca.2.extract19 = extractvalue { float, float, float, float, float, float, float, float } %5, 2                                                                                                                                     
   %.fca.3.extract20 = extractvalue { float, float, float, float, float, float, float, float } %5, 3                                                                                                                                     
   %.fca.4.extract21 = extractvalue { float, float, float, float, float, float, float, float } %5, 4                                                                                                                                     
   %.fca.5.extract22 = extractvalue { float, float, float, float, float, float, float, float } %5, 5                                                                                                                                     
   %.fca.6.extract23 = extractvalue { float, float, float, float, float, float, float, float } %5, 6                                                                                                                                     
   %.fca.7.extract24 = extractvalue { float, float, float, float, float, float, float, float } %5, 7                                                                                                                                     
; └                                                                                                                                                                                                                                      
;  @ REPL[1]:4 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_c_col_m16n16k8_global_stride_f32`                                                                                              
   %6 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract52, i32 16)                                                                  
   %.fca.0.extract1 = extractvalue { float, float, float, float, float, float, float, float } %6, 0
   %.fca.1.extract2 = extractvalue { float, float, float, float, float, float, float, float } %6, 1
   %.fca.2.extract3 = extractvalue { float, float, float, float, float, float, float, float } %6, 2
   %.fca.3.extract4 = extractvalue { float, float, float, float, float, float, float, float } %6, 3
   %.fca.4.extract5 = extractvalue { float, float, float, float, float, float, float, float } %6, 4
   %.fca.5.extract6 = extractvalue { float, float, float, float, float, float, float, float } %6, 5
   %.fca.6.extract7 = extractvalue { float, float, float, float, float, float, float, float } %6, 6
   %.fca.7.extract8 = extractvalue { float, float, float, float, float, float, float, float } %6, 7
; └
;  @ REPL[1]:6 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:355 within `llvm_wmma_mma_col_col_m16n16k8_tf32`
   %7 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float %.fca.0.extract33, float %.fca.1.extract34, float %.fca.2.extract35, float %.fca.3.extract36, float %.fca.4.extra
ct37, float %.fca.5.extract38, float %.fca.6.extract39, float %.fca.7.extract40, float %.fca.0.extract17, float %.fca.1.extract18, float %.fca.2.extract19, float %.fca.3.extract20, float %.fca.4.extract21, float %.fca.5.extract22, fl
oat %.fca.6.extract23, float %.fca.7.extract24, float %.fca.0.extract1, float %.fca.1.extract2, float %.fca.2.extract3, float %.fca.3.extract4, float %.fca.4.extract5, float %.fca.5.extract6, float %.fca.6.extract7, float %.fca.7.ext
ract8)
   %.fca.0.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 0
   %.fca.1.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 1
   %.fca.2.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 2
   %.fca.3.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 3
   %.fca.4.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 4
   %.fca.5.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 5
   %.fca.6.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 6
   %.fca.7.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 7
; └
;  @ REPL[1]:8 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:277 within `llvm_wmma_store_d_col_m16n16k8_global_stride_f32`
   call void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract49, float %.fca.0.extract, float %.fca.1.extract, float %.fca.2.extract, float %.fca.3.extract, float %.fca.4.extract, float %.fca.5.ex
tract, float %.fca.6.extract, float %.fca.7.extract, i32 16)
; └
;  @ REPL[1]:9 within `kernel_wmma_tf32_lowlevel`
  ret void
}

attributes #0 = { argmemonly nounwind readonly }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind writeonly }
attributes #3 = { "probe-stack"="inline-asm" }

!llvm.module.flags = !{!0, !1}
!nvvm.annotations = !{!2}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ([1 x i64], { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })* @_Z36julia_kernel_wmma_tf32_lowlevel
_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE, !"kernel", i32 1}
julia: /workspace/srcdir/llvm-project/llvm/include/llvm/CodeGen/SelectionDAGNodes.h:1105: llvm::SDValue::SDValue(llvm::SDNode*, unsigned int): Assertion `(!Node || !ResNo || ResNo < Node->getNumValues()) && "Invalid result number for
 the given node!"' failed.

signal (6): Aborted
[...]

maleadt · 2022-03-04T06:38:06Z

You want to include dump_module=true so that we can process the IR with llc outside of Julia. If that still asserts, we can try reducing it to figure out if we're generating bad code, or whether this is an LLVM MC issue.

carstenbauer · 2022-03-04T14:21:42Z

You want to include dump_module=true so that we can process the IR with llc outside of Julia. If that still asserts, we can try reducing it to figure out if we're generating bad code, or whether this is an LLVM MC issue.

I updated my last comment and included dump_module=true.

maleadt · 2022-03-04T14:43:28Z

Endlines were corrupted. Here's the fixed version:

; PTX CompilerJob of kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) for sm_80                                                          
; ModuleID = 'text'                                                                                                                                                                                                                      
source_filename = "text"                                                                                                                                                                                                                 
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"                                                                                  
target triple = "nvptx64-nvidia-cuda"                                                                                                                                                                                                    
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                     
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                     
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind readonly                                                                                                                                                                                           
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0                                                                      
                                                                                                                                                                                                                                         
; Function Attrs: nounwind readnone                                                                                                                                                                                                      
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float , float, float, float, float, float, float) #1                                                                                                                                                                                           
                                                                                                                                                                                                                                         
; Function Attrs: argmemonly nounwind writeonly                                                                                                                                                                                          
declare void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* nocapture writeonly, float, float, float, float, float, float, float, float, i32) #2                                                                  
                                                                                                                                                                                                                                         
;  @ REPL[1]:1 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
define ptx_kernel void @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE([1 x i64] %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, { i8 addrspace(1)*, i64, [ 2 x i64], i64 } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3) local_unnamed_addr #3 {                                                                                                  
entry:                                                                                                                                                                                                                                   
  %.fca.0.extract72 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0                                                                                                                                                       
  %.fca.0.extract62 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %1, 0                                                                                                                                                       
  %.fca.0.extract52 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, 0                                                                                                                                                       
  %.fca.0.extract49 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0                                                                                                                                                       
;  @ REPL[1]:2 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_a_col_m16n16k8_global_stride_tf32`                                                                                             
   %4 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract72, i32 16)                                                                 
   %.fca.0.extract33 = extractvalue { float, float, float, float, float, float, float, float } %4, 0                                                                                                                                     
   %.fca.1.extract34 = extractvalue { float, float, float, float, float, float, float, float } %4, 1                                                                                                                                     
   %.fca.2.extract35 = extractvalue { float, float, float, float, float, float, float, float } %4, 2                                                                                                                                     
   %.fca.3.extract36 = extractvalue { float, float, float, float, float, float, float, float } %4, 3                                                                                                                                     
   %.fca.4.extract37 = extractvalue { float, float, float, float, float, float, float, float } %4, 4                                                                                                                                     
   %.fca.5.extract38 = extractvalue { float, float, float, float, float, float, float, float } %4, 5                                                                                                                                     
   %.fca.6.extract39 = extractvalue { float, float, float, float, float, float, float, float } %4, 6                                                                                                                                     
   %.fca.7.extract40 = extractvalue { float, float, float, float, float, float, float, float } %4, 7                                                                                                                                     
; └                                                                                                                                                                                                                                      
;  @ REPL[1]:3 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_b_col_m16n16k8_global_stride_tf32`                                                                                             
   %5 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract62, i32 8)                                                                  
   %.fca.0.extract17 = extractvalue { float, float, float, float, float, float, float, float } %5, 0                                                                                                                                     
   %.fca.1.extract18 = extractvalue { float, float, float, float, float, float, float, float } %5, 1                                                                                                                                     
   %.fca.2.extract19 = extractvalue { float, float, float, float, float, float, float, float } %5, 2                                                                                                                                     
   %.fca.3.extract20 = extractvalue { float, float, float, float, float, float, float, float } %5, 3                                                                                                                                     
   %.fca.4.extract21 = extractvalue { float, float, float, float, float, float, float, float } %5, 4                                                                                                                                     
   %.fca.5.extract22 = extractvalue { float, float, float, float, float, float, float, float } %5, 5                                                                                                                                     
   %.fca.6.extract23 = extractvalue { float, float, float, float, float, float, float, float } %5, 6                                                                                                                                     
   %.fca.7.extract24 = extractvalue { float, float, float, float, float, float, float, float } %5, 7                                                                                                                                     
; └                                                                                                                                                                                                                                      
;  @ REPL[1]:4 within `kernel_wmma_tf32_lowlevel`                                                                                                                                                                                        
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_c_col_m16n16k8_global_stride_f32`                                                                                              
   %6 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract52, i32 16)                                                                  
   %.fca.0.extract1 = extractvalue { float, float, float, float, float, float, float, float } %6, 0
   %.fca.1.extract2 = extractvalue { float, float, float, float, float, float, float, float } %6, 1
   %.fca.2.extract3 = extractvalue { float, float, float, float, float, float, float, float } %6, 2
   %.fca.3.extract4 = extractvalue { float, float, float, float, float, float, float, float } %6, 3
   %.fca.4.extract5 = extractvalue { float, float, float, float, float, float, float, float } %6, 4
   %.fca.5.extract6 = extractvalue { float, float, float, float, float, float, float, float } %6, 5
   %.fca.6.extract7 = extractvalue { float, float, float, float, float, float, float, float } %6, 6
   %.fca.7.extract8 = extractvalue { float, float, float, float, float, float, float, float } %6, 7
; └
;  @ REPL[1]:6 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:355 within `llvm_wmma_mma_col_col_m16n16k8_tf32`
   %7 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float %.fca.0.extract33, float %.fca.1.extract34, float %.fca.2.extract35, float %.fca.3.extract36, float %.fca.4.extract37, float %.fca.5.extract38, float %.fca.6.extract39, float %.fca.7.extract40, float %.fca.0.extract17, float %.fca.1.extract18, float %.fca.2.extract19, float %.fca.3.extract20, float %.fca.4.extract21, float %.fca.5.extract22, float %.fca.6.extract23, float %.fca.7.extract24, float %.fca.0.extract1, float %.fca.1.extract2, float %.fca.2.extract3, float %.fca.3.extract4, float %.fca.4.extract5, float %.fca.5.extract6, float %.fca.6.extract7, float %.fca.7.extract8)
   %.fca.0.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 0
   %.fca.1.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 1
   %.fca.2.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 2
   %.fca.3.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 3
   %.fca.4.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 4
   %.fca.5.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 5
   %.fca.6.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 6
   %.fca.7.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 7
; └
;  @ REPL[1]:8 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:277 within `llvm_wmma_store_d_col_m16n16k8_global_stride_f32` call void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract49, float %.fca.0.extract, float %.fca.1.extract, float %.fca.2.extract, float %.fca.3.extract, float %.fca.4.extract, float %.fca.5.ex tract, float %.fca.6.extract, float %.fca.7.extract, i32 16)
; └
;  @ REPL[1]:9 within `kernel_wmma_tf32_lowlevel`
  ret void
}

attributes #0 = { argmemonly nounwind readonly }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind writeonly }
attributes #3 = { "probe-stack"="inline-asm" }

!llvm.module.flags = !{!0, !1}
!nvvm.annotations = !{!2}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ([1 x i64], { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })* @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE, !"kernel", i32 1}

If I try to process this with llc I get:

Intrinsic has incorrect return type!
{ float, float, float, float, float, float, float, float } (i8 addrspace(1)*, i32)* @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8
Intrinsic has incorrect return type!
{ float, float, float, float, float, float, float, float } (i8 addrspace(1)*, i32)* @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8
Intrinsic has incorrect argument type!
{ float, float, float, float, float, float, float, float } (float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float)* @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32

So probably an intrinsic misuse?

carstenbauer · 2022-03-04T15:14:19Z

Thanks for helping me to debug this Tim! I was using the wrong fragment sizes. I fixed this in the latest commit for which call_kernel() now runs through without any issues. Will add more tests and update the docstrings next.

carstenbauer · 2022-06-13T21:42:25Z

TODOs

add tests
docstrings
(maybe high-level API)

carstenbauer · 2022-06-14T06:39:09Z

Note: Tests are failing since they're running under Julia 1.6 (and 1.8 is required for this PR). Anything I can / should do about it on my end @maleadt?

maleadt · 2022-06-14T10:46:32Z

Only 1.6 is used for draft PRs, but the tests shouldn't crash there, instead you should check VERSION and skip unsupported combinations. Once that's in there, we can mark the PR as non-draft and run all tests.

src/device/intrinsics/wmma.jl

carstenbauer · 2022-08-18T10:25:11Z

If the tests pass (which they should) we can mark this as non-draft IMHO.

carstenbauer · 2022-10-17T08:35:17Z

Should also be rebased so that CI runs.

Done (I think).

maleadt · 2022-10-17T12:07:48Z

CI failures look related.

maleadt added cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request labels Mar 2, 2022

This was referenced Mar 3, 2022

WMMA BFloat16 (BF16) #1425

Draft

WMMA Float64 #1426

Draft

carstenbauer mentioned this pull request Mar 4, 2022

WMMA kernel works with Julia 1.7.2 but fails with illegal memory access for Julia 1.8.0-beta1 #1431

Closed

maleadt force-pushed the master branch from db0ecb0 to 4d6dbd1 Compare May 13, 2022 07:37

carstenbauer force-pushed the cb/wmmatf32 branch from b93511f to ea19842 Compare August 17, 2022 15:37

maleadt reviewed Aug 17, 2022

View reviewed changes

src/device/intrinsics/wmma.jl Outdated Show resolved Hide resolved

carstenbauer marked this pull request as ready for review September 26, 2022 16:07

carstenbauer added 6 commits October 17, 2022 10:32

fix tf32 mma llvm intrinsic

afea816

fix wmma tf32 fragment sizes for a and b

bd7e2e0

wmma tf32 tests

17af4d1

exclude TF32 tests for Julia <= 1.7

f8c5583

add tf32 to docstrings

dd5a187

undo formatting

f4b0bb3

carstenbauer force-pushed the cb/wmmatf32 branch from 8d96e35 to f4b0bb3 Compare October 17, 2022 08:32

maleadt force-pushed the master branch from 476979e to d53a63e Compare March 16, 2023 12:34

maleadt added the needs changes Changes are needed. label May 5, 2023

maleadt marked this pull request as draft May 5, 2023 10:01

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMMA TensorFloat32 (TF32) #1419

WMMA TensorFloat32 (TF32) #1419

carstenbauer commented Mar 1, 2022 •

edited

Loading

carstenbauer commented Mar 1, 2022 •

edited

Loading

thomasfaingnaert commented Mar 1, 2022

thomasfaingnaert commented Mar 2, 2022

carstenbauer commented Mar 2, 2022 •

edited

Loading

carstenbauer commented Mar 2, 2022 •

edited

Loading

carstenbauer commented Mar 2, 2022 •

edited

Loading

thomasfaingnaert commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 •

edited

Loading

maleadt commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 •

edited

Loading

carstenbauer commented Mar 3, 2022 •

edited

Loading

maleadt commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 •

edited

Loading

maleadt commented Mar 4, 2022

carstenbauer commented Mar 4, 2022

maleadt commented Mar 4, 2022

carstenbauer commented Mar 4, 2022

carstenbauer commented Jun 13, 2022 •

edited

Loading

carstenbauer commented Jun 14, 2022

maleadt commented Jun 14, 2022

carstenbauer commented Aug 18, 2022

carstenbauer commented Oct 17, 2022 •

edited

Loading

maleadt commented Oct 17, 2022

WMMA TensorFloat32 (TF32) #1419

Are you sure you want to change the base?

WMMA TensorFloat32 (TF32) #1419

Conversation

carstenbauer commented Mar 1, 2022 • edited Loading

carstenbauer commented Mar 1, 2022 • edited Loading

thomasfaingnaert commented Mar 1, 2022

thomasfaingnaert commented Mar 2, 2022

carstenbauer commented Mar 2, 2022 • edited Loading

carstenbauer commented Mar 2, 2022 • edited Loading

carstenbauer commented Mar 2, 2022 • edited Loading

thomasfaingnaert commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 • edited Loading

maleadt commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 • edited Loading

carstenbauer commented Mar 3, 2022 • edited Loading

maleadt commented Mar 3, 2022

carstenbauer commented Mar 3, 2022 • edited Loading

maleadt commented Mar 4, 2022

carstenbauer commented Mar 4, 2022

maleadt commented Mar 4, 2022

carstenbauer commented Mar 4, 2022

carstenbauer commented Jun 13, 2022 • edited Loading

carstenbauer commented Jun 14, 2022

maleadt commented Jun 14, 2022

carstenbauer commented Aug 18, 2022

carstenbauer commented Oct 17, 2022 • edited Loading

maleadt commented Oct 17, 2022

carstenbauer commented Mar 1, 2022 •

edited

Loading

carstenbauer commented Mar 1, 2022 •

edited

Loading

carstenbauer commented Mar 2, 2022 •

edited

Loading

carstenbauer commented Mar 2, 2022 •

edited

Loading

carstenbauer commented Mar 2, 2022 •

edited

Loading

carstenbauer commented Mar 3, 2022 •

edited

Loading

carstenbauer commented Mar 3, 2022 •

edited

Loading

carstenbauer commented Mar 3, 2022 •

edited

Loading

carstenbauer commented Mar 3, 2022 •

edited

Loading

carstenbauer commented Jun 13, 2022 •

edited

Loading

carstenbauer commented Oct 17, 2022 •

edited

Loading