Use preferences to switch between SIMD and KernelAbstractions#133
Use preferences to switch between SIMD and KernelAbstractions#133b-fg merged 22 commits intoWaterLily-jl:masterfrom
Conversation
|
Thanks for catching that. I was aware that Also, not specifying the workgroup size did not yield to noticeable performance increase compared to |
It is a bit tricky between CPU and GPU. Right now the the KA backend on the CPU is rather slow since the basecase size is small. The CPU does much better with larger basecases. Now we don't have a way to calculate that basecase automatically so we use 1024 on the CPU as a default. On the GPU a static basecase is nice since it allows for some of the index integer operations to be optimized away. |
|
I did some preliminary benchmarks with different mesh sizes |
|
It would be interesting to use |
|
On my laptop GPU, I found no regression with this PR. In fact a very small speed up: TGV (b01cdce is this PR, 5c78c37 is this PR with 64 workgroup size, f38bea4 is master) Jelly |
|
I did some more benchmarks after a local merge of master with this PR. All looks good except removing the workgroup size as we had it it before ( Benchmarks |
|
So the default workgroupsize for KA is 1024. With 64 you create a lot of small tasks, what is the typical ndrange you use? |
|
For example, the TGV case is a 3D case for which I tested domain sizes of 64^3 and 128^3. The arrays we use are then |
|
Ah so you are getting perfectly sized blocks, by accident xD You may want to use |
|
Sure, I will do some tests after my summer break. But does this mean that we cannot use the default workgrup size (as in this PR)? Could this be something to improve in KA, where it would try to automatically determine it based on ndrange? |
|
Yeah I will need to improve this on the KA side |
|
I just tagged a new KA version with the fix. This might remove the need for the SIMD variant entirely. |
|
I have tested the changes and while the results improve, it is still not there (again, Benchmarks |
|
@b-fg Something I picked up out today. Currently, the It's kind of related to this I suppose, that's why I added it here. |
|
You mean these test lines , right? Line 6 in c4d3500 This is not related to this PR though. The problem is how to automatically detect that CUDA is available without loading |
When the SIMD backend is selected, the loop macro generates only the for loop, without a function wrapper. Also, dispatch based on number of threads has been removed, and now only the backend-specific kernel is compiled. The CI needs to be fixed for the allocations tests, which first needs to set the SIMD backend, and then re-run the tests.
|
As a result of our conversation in #198, I thought it was about time to put this in use... So I have cleaned up a bit the Preferences.jl routines with the new API, and now The only thing left to figure out is the allocation tests, which currently I do not know how to update the CI so that the |
Codecov ReportAttention: Patch coverage is
... and 2 files with indirect coverage changes 🚀 New features to boost your workflow:
|
|
To do:
|
…0pc slower than master...
|
I have experimenting again with the workgroup size, and the best results are almost always with constant @simd for $I ∈ $R
@fastmath @inbounds $ex
endis 40% slower than this one: function $kern($(rep.(sym)...))
@simd for $I ∈ $R
@fastmath @inbounds $ex
end
end
$kern($(sym...))I do not understand why. In any case, I have reverted back to the other wrapped one, which is similar to what we have in master, but without using the dynamic dispatch based on threads number. And now I think we should really try the specialization for each argument as we discussed, @weymouth. With the current PR state, these are the benchmarks: |
|
New CI working nicely with LocalPreferences! |
|
@vchuravy any ideas on how to bypass the warning of using KA in single thread (instead of SIMD) during precompilation? Otherwise, since we use KA as default backend and precompilation typically happens with a single thread, the warning is always popping up, which is not ideal. I am not sure how this can be addressed. |
|
The function kern(a::A,b::B,c::C,...) where {A,B,C,...}
...
endBelow is an example Details@macroexpand WaterLily.@loop a[I] += b[I] over I in CartesianIndices(a)
quote
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:155 =#
function var"##kern#242"(a::DBGS, b::FHGH) where {DBGS, FHGH}
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:155 =#
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:156 =#
begin
#= simdloop.jl:69 =#
let var"##r#244" = CartesianIndices(a)
#= simdloop.jl:70 =#
for var"##i#245" = Base.simd_outer_range(var"##r#244")
#= simdloop.jl:71 =#
let var"##n#246" = Base.simd_inner_length(var"##r#244", var"##i#245")
#= simdloop.jl:72 =#
if zero(var"##n#246") < var"##n#246"
#= simdloop.jl:74 =#
let var"##i#247" = zero(var"##n#246")
#= simdloop.jl:75 =#
while var"##i#247" < var"##n#246"
#= simdloop.jl:76 =#
local I = Base.simd_index(var"##r#244", var"##i#245", var"##i#247")
#= simdloop.jl:77 =#
begin
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:157 =#
begin
$(Expr(:inbounds, true))
local var"#4#val" = (a[I] += b[I])
$(Expr(:inbounds, :pop))
var"#4#val"
end
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:158 =#
end
#= simdloop.jl:78 =#
var"##i#247" += 1
#= simdloop.jl:79 =#
$(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))
#= simdloop.jl:80 =#
end
end
end
end
#= simdloop.jl:84 =#
end
end
#= simdloop.jl:86 =#
nothing
end
end
#= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:160 =#
var"##kern#242"(a, b)
end |
|
Consistent 1-2% speedup on all backends, and we will (hopefully) be able to specialize kernels passing a function. So this ticks all the boxes :) BenchmarksDetails |
|
We should address the warning issue, and this PR is good to go! |
Nice! It's interesting that allocations are down for single threads but not for multiple threads. |
|
Yes! Removing the dynamic dispatch based on number of threads, and instead just compiling the SIMD kernel based on LocalPreferences, brought down allocations significantly for single thread. Then the general small speedup gain resulted from specializing the wrapper function. |
…nd not for every ext module when loading.
I was experimenting with using PrecompileTools on WaterLily, and the choice to dispatch to the SIMD
backend depending on the
nthreadsvariable caused issues.nthreadsis no longer constant.nthreads == 1in the precompilation process, thus exercising the wrong code path.Opening this as a draft for now to solicit feedback. One probably would need to change the tests such that both code-paths are tested.