-
Notifications
You must be signed in to change notification settings - Fork 263
Closed
Description
Describe the bug
Dot product of a complex CuArray with a real CuArray: memory pre-allocated function is slower than one that allocates memory.
To reproduce
The Minimal Working Example (MWE) for this bug:
using CUDA, BenchmarkTools, LinearAlgebra
N = 10000
a = CUDA.ones(Float32, N)
b = CUDA.ones(ComplexF32, N)
b_re = real.(b)
b_im = imag.(b)
function dot_complex(a::CuArray{Float32}, b::CuArray{ComplexF32})
dot(complex.(a, CUDA.zeros(length(a))), b)
end
function dot_real(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
complex.(dot(a, b_re), dot(a, b_im))
end
@btime CUDA.@sync dot_complex($a, $b) #60.400 μs (45 allocations: 1.02 KiB)
@btime CUDA.@sync dot_real($a,$b_re,$b_im) #76.700 μs (17 allocations: 288 bytes)
Manifest.toml
[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "39f6f584bec264ace76f924d1c8637c85617697e"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.4.0"
[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "f99a25fe0313121f2f9627002734c7d63b4dd3bd"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.2.0"
[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "d0d99629d6ae4a3e211ae83d8870907bd842c811"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.5.2"
Expected behavior
Would expect dot_real to be faster since no memory is allocated during runtime.
Version info
Details on Julia:
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
JULIA_NUM_THREADS =
Details on CUDA:
# please post the output of:
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.71.0
Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+451.22
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
1 device:
0: GeForce GTX 1050 Ti (sm_61, 3.216 GiB / 4.000 GiB available)
Additional context
Originally posted as #667
Please excuse for the double-issue.
I have reduced the problem to pure CUDA.jl without StructArrays.jl considerations.
Metadata
Metadata
Assignees
Labels
No labels