-
Notifications
You must be signed in to change notification settings - Fork 239
WMMA TensorFloat32 (TF32) #1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The initial naive attempt didn't work. For the following test kernel, function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)
d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32_tf32(a_frag, b_frag, c_frag)
WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
return nothing
end
function call_kernel()
m = n = 16
k = 8
dtype_a = dtype_b = Float32
dtype_c = dtype_d = Float32
d_a = CUDA.rand(dtype_a, m, k)
d_b = CUDA.rand(dtype_b, k, n)
d_c = CUDA.rand(dtype_c, m, n)
d_d = CUDA.zeros(dtype_d, m, n)
CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
return nothing
end I get julia> call_kernel()
ERROR: InvalidIRError: compiling kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, Cu
DeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8)
Stacktrace:
[1] llvm_wmma_load_a_col_m16n16k8_global_stride_tf32
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214
[2] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:2
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8)
Stacktrace:
[1] llvm_wmma_load_b_col_m16n16k8_global_stride_tf32
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214
[2] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:3
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8)
Stacktrace:
[1] llvm_wmma_load_c_col_m16n16k8_global_stride_f32
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:214
[2] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:4
Reason: unsupported use of an undefined name (use of 'llvm_wmma_mma_col_col_m16n16k8_tf32_tf32')
Stacktrace:
[1] getproperty
@ ./Base.jl:35
[2] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:6
Reason: unsupported dynamic function invocation
Stacktrace:
[1] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:6
Reason: unsupported dynamic function invocation (call to llvm_wmma_store_d_col_m16n16k8_global_stride_f32)
Stacktrace:
[1] kernel_wmma_tf32_lowlevel
@ ./REPL[2]:8
HINT: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel_wmma_tf32_lowlevel), NTuple{4, CuDeviceMatrix{Float32, 1}}}}, args::LLVM.Module)
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/validation.jl:119
[2] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:327 [inlined]
[3] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:325 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:326
[7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90
[8] cufunction(f::typeof(kernel_wmma_tf32_lowlevel), tt::Type{NTuple{4, CuDeviceMatrix{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:297
[9] cufunction
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:291 [inlined]
[10] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:102 [inlined]
[11] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/utilities.jl:25 [inlined]
[12] call_kernel()
@ Main ./REPL[3]:12
[13] top-level scope
@ REPL[4]:1
[14] top-level scope
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/initialization.jl:52 Any form of help would be highly appreciated :) |
On first sight, there are two issues:
This name is generated at CUDA.jl/src/device/intrinsics/wmma.jl Line 334 in c24234d
CUDA.jl/src/device/intrinsics/wmma.jl Line 103 in c24234d
WMMA.llvm_wmma_mma_col_col_m16n16k8_f32_f32 instead of WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32_tf32 .
Have you verified that this is the correct name of the intrinsic? You can do so by compiling a CUDA C source file that uses TF32 WMMA using Clang, and using the command-line option |
Actually, it might be easier to just look at the tests of Clang for the expected intrinsic names. |
Thanks for your comments. I fixed (1): function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)
d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_f32_f32(a_frag, b_frag, c_frag)
WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
return nothing
end
function call_kernel()
m = n = 16
k = 8
dtype_a = dtype_b = Float32
dtype_c = dtype_d = Float32
d_a = CUDA.rand(dtype_a, m, k)
d_b = CUDA.rand(dtype_b, k, n)
d_c = CUDA.rand(dtype_c, m, n)
d_d = CUDA.zeros(dtype_d, m, n)
CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
return nothing
end As for (2), here it says
so I think that the intrinsic is correct. (Am I right in assuming that the names in the comments are the ones to check?)
Can you elaborate? Do you mean that our LLVM is too old and we need specific NVPTX patches / backports? If so, I guess that would imply that there is nothing that can be done here (and that things can only work from Julia 1.9 on?) |
I should probably say that I've used Julia 1.7 above. Trying Julia 1.8-beta1, which uses LLVM 13 (instead of 12) the loads seem to pass and I get julia> call_kernel()
ERROR: InvalidIRError: compiling kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) r
esulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to llvm.nvvm.wmma.m16n16k8.mma.col.col.f32.f32)
Stacktrace:
[1] llvm_wmma_mma_col_col_m16n16k8_f32_f32
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/device/intrinsics/wmma.jl:355
[2] kernel_wmma_tf32_lowlevel
@ ./REPL[3]:6
HINT: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel_wmma_tf32_lowlevel), NTuple{4, CuDeviceM
atrix{Float32, 1}}}}, args::LLVM.Module)
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/validation.jl:119 [2] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:327 [inlined]
[3] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/TimerOutputs/5tW2E/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:325 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:326
[7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90
[8] cufunction(f::typeof(kernel_wmma_tf32_lowlevel), tt::Type{NTuple{4, CuDeviceMatrix{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(),
Tuple{}}})
@ CUDA /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:297
[9] cufunction
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:291 [inlined]
[10] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/compiler/execution.jl:102 [inlined]
[11] macro expansion
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/utilities.jl:25 [inlined]
[12] call_kernel()
@ Main ./REPL[4]:12
[13] top-level scope
@ REPL[5]:1
[14] top-level scope
@ /scratch/pc2-mitarbeiter/bauerc/devel/PC2GPUBenchmarks.jl/dev/CUDA/src/initialization.jl:52 So, now it complains about |
Good news: with the current state (and Julia >= 1.8) the errors from above are gone with this corrected example: function kernel_wmma_tf32_lowlevel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k8_global_stride_tf32(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k8_global_stride_tf32(pointer(b_dev), 8)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k8_global_stride_f32(pointer(c_dev), 16)
d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k8_tf32(a_frag, b_frag, c_frag)
WMMA.llvm_wmma_store_d_col_m16n16k8_global_stride_f32(pointer(d_dev), d_frag, 16)
return nothing
end
function call_kernel()
m = n = 16
k = 8
dtype_a = dtype_b = Float32
dtype_c = dtype_d = Float32
d_a = CUDA.rand(dtype_a, m, k)
d_b = CUDA.rand(dtype_b, k, n)
d_c = CUDA.rand(dtype_c, m, n)
d_d = CUDA.zeros(dtype_d, m, n)
CUDA.@sync @cuda kernel_wmma_tf32_lowlevel(d_a, d_b, d_c, d_d)
return nothing
end Bad news: I now get a segfault instead 😄
|
Yes, exactly.
Yes, this is probably the patch that you need, and is missing in Julia 1.7: https://reviews.llvm.org/D104847.
Looks like a bug in instruction selection in LLVM. I think @maleadt can help you more than I can with that issue. |
Note that I see a similar (the same?) segfault with f64 wmma: #1426 (comment) |
Try using an assertions build. That should become easier as soon as I tag the next version of LLVM.jl (today, normally). |
When I use an assertions build of julia I get julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
┌ Warning: You are using a version of Julia that links against a build of LLVM with assertions enabled.
│
│ This is not supported out-of-the-box, and you need a build of libLLVMExtra that supports this.
│ Use `deps/build_local.jl` for that, add the resulting LocalPreferences.toml to your project
│ and add a direct dependency on LLVMExtra_jll to pick up those preferences.
└ @ LLVM /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/MJqe4/src/LLVM.jl:24
[...] AFAICT, there is no |
Alright, I julia> call_kernel()
julia: /workspace/srcdir/llvm-project/llvm/include/llvm/CodeGen/SelectionDAGNodes.h:1105: llvm::SDValue::SDValue(llvm::SDNode*, unsigned int): Assertion `(!Node |
| !ResNo || ResNo < Node->getNumValues()) && "Invalid result number for the given node!"' failed.
signal (6): Aborted
in expression starting at REPL[9]:1
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__assert_fail_base.cold.0 at /lib64/libc.so.6 (unknown line)
__assert_fail at /lib64/libc.so.6 (unknown line)
_ZN4llvm7SDValueC1EPNS_6SDNodeEj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel9MorphNodeEPNS_6SDNodeEjNS_8SDVTListENS_8ArrayRefINS_7SDValueEEEj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin
/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unkno
wn line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (un
known line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.972 at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/lib
LLVM-13jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown
line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
_ZL21LLVMTargetMachineEmitP23LLVMOpaqueTargetMachineP16LLVMOpaqueModuleRN4llvm17raw_pwrite_streamE19LLVMCodeGenFileTypePPc at /scratch/pc2-mitarbeiter/bauerc/buil
ding/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/building/julia/julia-source/usr/bin/../lib/libLLVM-13jl.so (unknown line)
LLVMTargetMachineEmitToMemoryBuffer at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/P1t7P/lib/13/libLLVM_h.jl:947 [inlined]
emit at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LLVM/P1t7P/src/targetmachine.jl:45
mcgen at /scratch/pc2-mitarbeiter/bauerc/.julia/packages/GPUCompiler/I9fZc/src/mcgen.jl:74
unknown function (ip: 0x152e1961abdf)
[...] |
Can you show the IR? |
You want the LLVM IR, i.e. julia> @device_code_llvm dump_module=true call_kernel() [126/239]
; PTX CompilerJob of kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) for sm_80
; ModuleID = 'text'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: nounwind readnone
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float
, float, float, float, float, float, float) #1
; Function Attrs: argmemonly nounwind writeonly
declare void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* nocapture writeonly, float, float, float, float, float, float, float, float, i32) #2
; @ REPL[1]:1 within `kernel_wmma_tf32_lowlevel`
define ptx_kernel void @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE([1 x i64] %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, { i8 addrspace(1)*, i64, [
2 x i64], i64 } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3) local_unnamed_addr #3 {
entry:
%.fca.0.extract72 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
%.fca.0.extract62 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %1, 0
%.fca.0.extract52 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, 0
%.fca.0.extract49 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0
; @ REPL[1]:2 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_a_col_m16n16k8_global_stride_tf32`
%4 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract72, i32 16)
%.fca.0.extract33 = extractvalue { float, float, float, float, float, float, float, float } %4, 0
%.fca.1.extract34 = extractvalue { float, float, float, float, float, float, float, float } %4, 1
%.fca.2.extract35 = extractvalue { float, float, float, float, float, float, float, float } %4, 2
%.fca.3.extract36 = extractvalue { float, float, float, float, float, float, float, float } %4, 3
%.fca.4.extract37 = extractvalue { float, float, float, float, float, float, float, float } %4, 4
%.fca.5.extract38 = extractvalue { float, float, float, float, float, float, float, float } %4, 5
%.fca.6.extract39 = extractvalue { float, float, float, float, float, float, float, float } %4, 6
%.fca.7.extract40 = extractvalue { float, float, float, float, float, float, float, float } %4, 7
; └
; @ REPL[1]:3 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_b_col_m16n16k8_global_stride_tf32`
%5 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract62, i32 8)
%.fca.0.extract17 = extractvalue { float, float, float, float, float, float, float, float } %5, 0
%.fca.1.extract18 = extractvalue { float, float, float, float, float, float, float, float } %5, 1
%.fca.2.extract19 = extractvalue { float, float, float, float, float, float, float, float } %5, 2
%.fca.3.extract20 = extractvalue { float, float, float, float, float, float, float, float } %5, 3
%.fca.4.extract21 = extractvalue { float, float, float, float, float, float, float, float } %5, 4
%.fca.5.extract22 = extractvalue { float, float, float, float, float, float, float, float } %5, 5
%.fca.6.extract23 = extractvalue { float, float, float, float, float, float, float, float } %5, 6
%.fca.7.extract24 = extractvalue { float, float, float, float, float, float, float, float } %5, 7
; └
; @ REPL[1]:4 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_c_col_m16n16k8_global_stride_f32`
%6 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract52, i32 16)
%.fca.0.extract1 = extractvalue { float, float, float, float, float, float, float, float } %6, 0
%.fca.1.extract2 = extractvalue { float, float, float, float, float, float, float, float } %6, 1
%.fca.2.extract3 = extractvalue { float, float, float, float, float, float, float, float } %6, 2
%.fca.3.extract4 = extractvalue { float, float, float, float, float, float, float, float } %6, 3
%.fca.4.extract5 = extractvalue { float, float, float, float, float, float, float, float } %6, 4
%.fca.5.extract6 = extractvalue { float, float, float, float, float, float, float, float } %6, 5
%.fca.6.extract7 = extractvalue { float, float, float, float, float, float, float, float } %6, 6
%.fca.7.extract8 = extractvalue { float, float, float, float, float, float, float, float } %6, 7
; └
; @ REPL[1]:6 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:355 within `llvm_wmma_mma_col_col_m16n16k8_tf32`
%7 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float %.fca.0.extract33, float %.fca.1.extract34, float %.fca.2.extract35, float %.fca.3.extract36, float %.fca.4.extra
ct37, float %.fca.5.extract38, float %.fca.6.extract39, float %.fca.7.extract40, float %.fca.0.extract17, float %.fca.1.extract18, float %.fca.2.extract19, float %.fca.3.extract20, float %.fca.4.extract21, float %.fca.5.extract22, fl
oat %.fca.6.extract23, float %.fca.7.extract24, float %.fca.0.extract1, float %.fca.1.extract2, float %.fca.2.extract3, float %.fca.3.extract4, float %.fca.4.extract5, float %.fca.5.extract6, float %.fca.6.extract7, float %.fca.7.ext
ract8)
%.fca.0.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 0
%.fca.1.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 1
%.fca.2.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 2
%.fca.3.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 3
%.fca.4.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 4
%.fca.5.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 5
%.fca.6.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 6
%.fca.7.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 7
; └
; @ REPL[1]:8 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:277 within `llvm_wmma_store_d_col_m16n16k8_global_stride_f32`
call void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract49, float %.fca.0.extract, float %.fca.1.extract, float %.fca.2.extract, float %.fca.3.extract, float %.fca.4.extract, float %.fca.5.ex
tract, float %.fca.6.extract, float %.fca.7.extract, i32 16)
; └
; @ REPL[1]:9 within `kernel_wmma_tf32_lowlevel`
ret void
}
attributes #0 = { argmemonly nounwind readonly }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind writeonly }
attributes #3 = { "probe-stack"="inline-asm" }
!llvm.module.flags = !{!0, !1}
!nvvm.annotations = !{!2}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ([1 x i64], { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })* @_Z36julia_kernel_wmma_tf32_lowlevel
_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE, !"kernel", i32 1}
julia: /workspace/srcdir/llvm-project/llvm/include/llvm/CodeGen/SelectionDAGNodes.h:1105: llvm::SDValue::SDValue(llvm::SDNode*, unsigned int): Assertion `(!Node || !ResNo || ResNo < Node->getNumValues()) && "Invalid result number for
the given node!"' failed.
signal (6): Aborted
[...] |
You want to include |
I updated my last comment and included |
Endlines were corrupted. Here's the fixed version: ; PTX CompilerJob of kernel kernel_wmma_tf32_lowlevel(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) for sm_80
; ModuleID = 'text'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: argmemonly nounwind readonly
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* nocapture readonly, i32) #0
; Function Attrs: nounwind readnone
declare { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float , float, float, float, float, float, float) #1
; Function Attrs: argmemonly nounwind writeonly
declare void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* nocapture writeonly, float, float, float, float, float, float, float, float, i32) #2
; @ REPL[1]:1 within `kernel_wmma_tf32_lowlevel`
define ptx_kernel void @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE([1 x i64] %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, { i8 addrspace(1)*, i64, [ 2 x i64], i64 } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3) local_unnamed_addr #3 {
entry:
%.fca.0.extract72 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
%.fca.0.extract62 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %1, 0
%.fca.0.extract52 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, 0
%.fca.0.extract49 = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0
; @ REPL[1]:2 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_a_col_m16n16k8_global_stride_tf32`
%4 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract72, i32 16)
%.fca.0.extract33 = extractvalue { float, float, float, float, float, float, float, float } %4, 0
%.fca.1.extract34 = extractvalue { float, float, float, float, float, float, float, float } %4, 1
%.fca.2.extract35 = extractvalue { float, float, float, float, float, float, float, float } %4, 2
%.fca.3.extract36 = extractvalue { float, float, float, float, float, float, float, float } %4, 3
%.fca.4.extract37 = extractvalue { float, float, float, float, float, float, float, float } %4, 4
%.fca.5.extract38 = extractvalue { float, float, float, float, float, float, float, float } %4, 5
%.fca.6.extract39 = extractvalue { float, float, float, float, float, float, float, float } %4, 6
%.fca.7.extract40 = extractvalue { float, float, float, float, float, float, float, float } %4, 7
; └
; @ REPL[1]:3 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_b_col_m16n16k8_global_stride_tf32`
%5 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32.p1i8(i8 addrspace(1)* %.fca.0.extract62, i32 8)
%.fca.0.extract17 = extractvalue { float, float, float, float, float, float, float, float } %5, 0
%.fca.1.extract18 = extractvalue { float, float, float, float, float, float, float, float } %5, 1
%.fca.2.extract19 = extractvalue { float, float, float, float, float, float, float, float } %5, 2
%.fca.3.extract20 = extractvalue { float, float, float, float, float, float, float, float } %5, 3
%.fca.4.extract21 = extractvalue { float, float, float, float, float, float, float, float } %5, 4
%.fca.5.extract22 = extractvalue { float, float, float, float, float, float, float, float } %5, 5
%.fca.6.extract23 = extractvalue { float, float, float, float, float, float, float, float } %5, 6
%.fca.7.extract24 = extractvalue { float, float, float, float, float, float, float, float } %5, 7
; └
; @ REPL[1]:4 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:214 within `llvm_wmma_load_c_col_m16n16k8_global_stride_f32`
%6 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract52, i32 16)
%.fca.0.extract1 = extractvalue { float, float, float, float, float, float, float, float } %6, 0
%.fca.1.extract2 = extractvalue { float, float, float, float, float, float, float, float } %6, 1
%.fca.2.extract3 = extractvalue { float, float, float, float, float, float, float, float } %6, 2
%.fca.3.extract4 = extractvalue { float, float, float, float, float, float, float, float } %6, 3
%.fca.4.extract5 = extractvalue { float, float, float, float, float, float, float, float } %6, 4
%.fca.5.extract6 = extractvalue { float, float, float, float, float, float, float, float } %6, 5
%.fca.6.extract7 = extractvalue { float, float, float, float, float, float, float, float } %6, 6
%.fca.7.extract8 = extractvalue { float, float, float, float, float, float, float, float } %6, 7
; └
; @ REPL[1]:6 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:355 within `llvm_wmma_mma_col_col_m16n16k8_tf32`
%7 = call { float, float, float, float, float, float, float, float } @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32(float %.fca.0.extract33, float %.fca.1.extract34, float %.fca.2.extract35, float %.fca.3.extract36, float %.fca.4.extract37, float %.fca.5.extract38, float %.fca.6.extract39, float %.fca.7.extract40, float %.fca.0.extract17, float %.fca.1.extract18, float %.fca.2.extract19, float %.fca.3.extract20, float %.fca.4.extract21, float %.fca.5.extract22, float %.fca.6.extract23, float %.fca.7.extract24, float %.fca.0.extract1, float %.fca.1.extract2, float %.fca.2.extract3, float %.fca.3.extract4, float %.fca.4.extract5, float %.fca.5.extract6, float %.fca.6.extract7, float %.fca.7.extract8)
%.fca.0.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 0
%.fca.1.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 1
%.fca.2.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 2
%.fca.3.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 3
%.fca.4.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 4
%.fca.5.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 5
%.fca.6.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 6
%.fca.7.extract = extractvalue { float, float, float, float, float, float, float, float } %7, 7
; └
; @ REPL[1]:8 within `kernel_wmma_tf32_lowlevel`
; ┌ @ /scratch/pc2-mitarbeiter/bauerc/devel/CUDA/src/device/intrinsics/wmma.jl:277 within `llvm_wmma_store_d_col_m16n16k8_global_stride_f32` call void @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32.p1i8(i8 addrspace(1)* %.fca.0.extract49, float %.fca.0.extract, float %.fca.1.extract, float %.fca.2.extract, float %.fca.3.extract, float %.fca.4.extract, float %.fca.5.ex tract, float %.fca.6.extract, float %.fca.7.extract, i32 16)
; └
; @ REPL[1]:9 within `kernel_wmma_tf32_lowlevel`
ret void
}
attributes #0 = { argmemonly nounwind readonly }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind writeonly }
attributes #3 = { "probe-stack"="inline-asm" }
!llvm.module.flags = !{!0, !1}
!nvvm.annotations = !{!2}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{void ([1 x i64], { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })* @_Z36julia_kernel_wmma_tf32_lowlevel_263813CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE, !"kernel", i32 1} If I try to process this with
So probably an intrinsic misuse? |
Thanks for helping me to debug this Tim! I was using the wrong fragment sizes. I fixed this in the latest commit for which |
TODOs
|
Note: Tests are failing since they're running under Julia 1.6 (and 1.8 is required for this PR). Anything I can / should do about it on my end @maleadt? |
Only 1.6 is used for draft PRs, but the tests shouldn't crash there, instead you should check |
b93511f
to
ea19842
Compare
If the tests pass (which they should) we can mark this as non-draft IMHO. |
8d96e35
to
f4b0bb3
Compare
Done (I think). |
CI failures look related. |
5d585c4
to
c850163
Compare
In this PR I will try to add TensorFloat32 (TF32) support for WMMAs. I'm surely the wrong person for the job, but let's see where things will take me / us 😄
Some resources:
cc: @HenriDeh @thomasfaingnaert