Skip to content

KernelAbstractions for FFTs#1099

Merged
mfherbst merged 3 commits into
JuliaMolSim:masterfrom
abussy:fft_kernels
Jun 5, 2025
Merged

KernelAbstractions for FFTs#1099
mfherbst merged 3 commits into
JuliaMolSim:masterfrom
abussy:fft_kernels

Conversation

@abussy

@abussy abussy commented May 19, 2025

Copy link
Copy Markdown
Collaborator

Careful GPU profiling shows that a lot of time is wasted reshuffling arrays during K-point specific FFT calls. Since FFTs are a large part of DftHamiltonian multiplication and compute_density, accelerating them can have a large impact on performance, especially for systems with dense k-point meshes.

For illustration, a ifft! call currently looks like this in the profiler:

fft_master

The 3 blue squares on the right are the actual FFT, and the blue square on the left is the kernel filling the array with zeros. All the rest, i.e. the majority of the runtime, is spend in this operation: f_real[Gvec_mapping] = f_fourier. Something similar happens for fft!.

This PR introduces explicit GPU kernels for the reshuffling of data during FFTs. The KernelAbstractions.jl package is used in order to remain vendor agnostic (already an indirect dependency of DFTK through GPU packages). The kernel themselves are very simple, but their performance is much better. The same ifft! call now looks like the following in the profiler:

fft_kernel

For this to be possible, some light code uglification is necessary. The Gvec_mapping field of a given K-point must be available on the GPU at all time, and transferring it to the device at each call would defeat the purpose of this PR. As a solution for this, I added a Gvec_mapping_gpu field to the K-point struct (could also be called Gvec_mapping_fft). I decided to duplicate the data rather than moving the original Gvec_mapping field to the device, because the latter is assumed to be on the CPU in most places in the code, and it's inverse mapping_inv is not in a format suitable for the GPU (and it makes sense that back and forth ops are stored in the same manner).

Finally, here are some timings (s) for the input show below:

GPU master this PR
A100 44 33
H100 51 40
Mi200 71 54
using DFTK
using PseudoPotentialData
using CUDA
setup_threading(;)

arch = DFTK.GPU(CuArray)

Ecut = 72.0
kgrid = [10, 10, 10] 
temp = 0.01
tol = 1.0e-8
maxiter = 10

a = 7.260
lattice = a * [[1. 0. 0.];
               [0. 1. 0.];
               [0. 0. 1.]]

Sr = ElementPsp(:Sr, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
V = ElementPsp(:V, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
O = ElementPsp(:O, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))

atoms     = [Sr, V, O, O, O]
positions = [[0.5, 0.5, 0.5],
             [0.0, 0.0, 0.0],
             [0.5, 0.0, 0.0],
             [0.0, 0.5, 0.0],
             [0.0, 0.0, 0.5]]

model = model_DFT(lattice, atoms, positions; temperature=temp, functionals=PBE(),
                  smearing=DFTK.Smearing.Gaussian())

#warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; tol=tol, maxiter=3);

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; tol=tol, maxiter=maxiter);
@show DFTK.timer

@antoine-levitt

Copy link
Copy Markdown
Member

Nice, but for large scale computations we might be better off just doing non variational computations on a cubic grid, which avoids this computation entirely.

@mfherbst

Copy link
Copy Markdown
Member

I think there is a sweet spot in between where you need variational data. Especially in data generation for ML consistency between energy and forces within a structure and across a dataset is quits important for example.

But we should really test "how bad" such inconsistencies would be for nonvariational.

@abussy

abussy commented May 20, 2025

Copy link
Copy Markdown
Collaborator Author

I've learnt to be wary of non variational calculations in my (more scientific) past, especially if gradients are involved.

Generally, I think that variational calculations should always be an option, and probably the default one. Then, we might as well do it efficiently.

@antoine-levitt

Copy link
Copy Markdown
Member

Careful about the word variational here. Non variational in the dftk context just means that the discretization is non variational, ie the energy is not necessarily above the true one. Also note that even the "variational" energy is not strictly variational, because the xc energy is approximated on a grid. The fact that the discretization is not variational doesn't mean that the discrete orbitals are not: they still minimize the discretized energy, and the hellmann feynman theorem is still valid. So I wouldn't expect problems like that.

@mfherbst

Copy link
Copy Markdown
Member

I'd need a more careful review, but I think given the performance improvement that's a reasonable amount of code to add, don't you agree @antoine-levitt ?

@antoine-levitt antoine-levitt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like the kernel overload should just be in GPUArrays, no? (ie x[I]=y should do what you implemented) Can we fix it there?

Comment thread src/fft.jl Outdated
@mfherbst

Copy link
Copy Markdown
Member

I think kernelabstractions also has a cpu backend, so perhaps we can actually avoid special-casing the gpu code. I also have a few issues with the proposed code structure ... will review more later.

@abussy

abussy commented May 26, 2025

Copy link
Copy Markdown
Collaborator Author

KernelAbstractions does have a CPU backend, and it uses Julia threads for parallelization (can also use OMP). I tested it, and overall performance drops when using multiple threads (compared to master). My interpretation is that since these FFTs are mostly (only?) called within parallel_loop_over_range, there might be some thread overload.

@mfherbst

Copy link
Copy Markdown
Member

there might be some thread overload.

Calling it on just one thread is not easily feasible ? If not then let's not bother and let's go with your proposed structure.

@abussy

abussy commented May 27, 2025

Copy link
Copy Markdown
Collaborator Author

Calling it on just one thread is not easily feasible ? If not then let's not bother and let's go with your proposed structure.

I looked into it, and it's not obvious. The only solution I found also underperforms wrt to master when running with threads.

Sounds like the kernel overload should just be in GPUArrays, no? (ie x[I]=y should do what you implemented) Can we fix it there?

I'll look into it, this could be a nice feature with potential impact elsewhere in the code. This would also mean overloading x = y[I] and using f_fourier = f_real[Gvec_mapping] in the fft calls rather than the current f_fourier .= view(f_real, Gvec_mapping) (I'll need to make sure there is no performance impact).

Comment thread src/Kpoint.jl Outdated
Comment thread src/Kpoint.jl

# Construct the kpoint with coordinate equivalent_kpt.coordinate + ΔG.
# Equivalent to (but faster than) Kpoint(equivalent_kpt.coordinate + ΔG).
function construct_from_equivalent_kpt(fft_size, equivalent_kpt, coordinate, ΔG)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a low-level function. Can't we instead pass architecture ?

Comment thread src/gpu/fft.jl Outdated
Comment thread src/fft.jl Outdated
@abussy

abussy commented Jun 5, 2025

Copy link
Copy Markdown
Collaborator Author
  • Removed KernelAbstractions.jl dependency and explicit kernels
  • Use @inbounds f_real[Gvec_mapping] = f_fourier for the iFFT
  • Use map!(i -> f_real[i], f_fourier, Gvec_mapping) for the FFT
  • Added parametric type for the Gvec_mapping_gpu field of Kpoint

The full performance of the explicit kernels is retained, while greatly simplifying the code.

@mfherbst

mfherbst commented Jun 5, 2025

Copy link
Copy Markdown
Member

Great, thank you !

@mfherbst mfherbst merged commit 858f73d into JuliaMolSim:master Jun 5, 2025
5 of 8 checks passed
niklasschmitz pushed a commit that referenced this pull request Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants