Skip to content

GPU optimized symmetry operations#1097

Merged
mfherbst merged 6 commits into
JuliaMolSim:masterfrom
abussy:symmetry_gpu
Jun 7, 2025
Merged

GPU optimized symmetry operations#1097
mfherbst merged 6 commits into
JuliaMolSim:masterfrom
abussy:symmetry_gpu

Conversation

@abussy

@abussy abussy commented May 16, 2025

Copy link
Copy Markdown
Collaborator

When simulating large systems with symmetries, the symmetrization of the density is extremely slow. This is because the array is first transferred to the CPU, before a double loop over symmetries and G vectors takes place. By comparison to loops running on the GPU, this is slow, and the operation becomes a major bottleneck.

This PR introduces GPU specific implementations for accumulate_over_symmetries! and lowpass_for_symemtry!, using the map! construct to run on the device, and thus saving data transfers. The loop itself is also greatly accelerated. Note that these new implementations do run on the CPU, but slower than the current solution.

For illustration, using the input below, I observe these timings (seconds):

GPU master this PR
A100 195 14
H100 135 9
Mi200 212 26
using DFTK
using PseudoPotentialData
using CUDA
# using AMDGPU
setup_threading(;)

arch = DFTK.GPU(CuArray)
# arch = DFTK.CPU(ROCArray)

Ecut = 32.0
kgrid = [2, 1, 1]
maxiter = 5
tol = 1.0e-8
temperature = 1e-4

factor = 4
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())

# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);
@show DFTK.timer

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some thoughts.

Comment thread src/gpu/symmetry.jl Outdated
Comment thread src/gpu/symmetry.jl Outdated
@abussy

abussy commented May 16, 2025

Copy link
Copy Markdown
Collaborator Author

Changing the order of the loops reduces the number of kernel calls and large copies. As a result, performance is greatly improved:

GPU master original PR new loop order
A100 195 14 8.5
H100 135 9 6.5
Mi200 212 26 24

Note that this introduces some level of clunkiness. For example, all data accessed within the map! mus be isbit. Therefore, various fields of the symmetry vector must first be copied in their own arrays and transferred to the GPU. The cost of these actual copies is negligible due to the small size of the data and the low frequency of the transfer (I compared timings to a case were I did these transfer once and for all, for peace of mind).

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Your current implementation, spiked another idea: Fuse both functions. If it's too complicated don't bother, but I think it should work.

Comment thread src/gpu/symmetry.jl Outdated
Comment thread src/gpu/symmetry.jl Outdated
Comment thread src/gpu/symmetry.jl Outdated

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks !

Comment thread src/symmetry.jl
Comment thread src/symmetry.jl Outdated
@mfherbst

mfherbst commented Jun 5, 2025

Copy link
Copy Markdown
Member

@abussy Failing test is the Inlining test of test/symmetry_issues.jl. Maybe your refactoring broke or invalidated that ?

@abussy

abussy commented Jun 5, 2025

Copy link
Copy Markdown
Collaborator Author

There is something off with the way that test is set up. If I switch the order of these 2 lines, suddenly the test passes. The actual numbers also change if I do a warmup before the test (calling the same 2 lines without the @allocated).

My interpretation is that, since the G_vectors_generator function is not called anymore in DFTK, there is some compilation taking place during the test.

In any case, that test is not well designed (that's on me). I'd suggest we get rid of it.

@mfherbst

mfherbst commented Jun 5, 2025

Copy link
Copy Markdown
Member

The G_vectors_generator has always been a hack to get the symmetrisation fast. Since it's now done differently and this function no longer needed, we can completely remove it.

Yeah I agree the test does not really make sense here any more. Is it still worth it to test that index_G_vectors is properly inlined in the new order of the loops ?

@abussy

abussy commented Jun 6, 2025

Copy link
Copy Markdown
Collaborator Author

The G_vectors_generator has always been a hack to get the symmetrisation fast. Since it's now done differently and this function no longer needed, we can completely remove it.

I guess we can. However it might be good to keep a trace of it, in case a similarly slow iterations pop up in the future. The function itself does not crowd to code base too much.

Yeah I agree the test does not really make sense here any more. Is it still worth it to test that index_G_vectors is properly inlined in the new order of the loops ?

I think that testing inlining cannot hurt. I modified the test to analyze the output of code_typed directly. It should be a lot more robust.

@mfherbst mfherbst enabled auto-merge (squash) June 6, 2025 14:05
@mfherbst mfherbst merged commit cf49fe2 into JuliaMolSim:master Jun 7, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants