Optimizing XC instantiation for GPUs by abussy · Pull Request #1262 · JuliaMolSim/DFTK.jl

abussy · 2026-02-20T17:08:22Z

This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.

Particular care was given to the hankel function in workarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.

For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):

	master	this PR
A100	65 s	2 s
Mi250	66 s	2 s

using DFTK
using PseudoPotentialData
using CUDA
setup_threading()

arch = DFTK.GPU(CuArray)

Ecut = 50.0
kgrid = [1, 1, 1]
maxiter = 10
tol = 1.0e-10
temperature = 0.01

factor = 3
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

#rattle a few atoms                                                                                  
positions[5] = positions[5] + [0.01, 0.0, -0.01]
positions[10] = positions[10] + [0.01, 0.0, -0.01]
positions[15] = positions[15] + [0.01, 0.0, -0.01]
positions[20] = positions[20] + [0.01, 0.0, -0.01]
positions[25] = positions[25] + [0.01, 0.0, -0.01]

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())
                  
# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, is_converged=ScfConvergenceEnergy(tol))
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)
@show DFTK.timer

mfherbst · 2026-02-25T16:51:50Z

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it.

abussy · 2026-02-26T09:15:53Z

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it

I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not.

abussy · 2026-03-10T10:46:03Z

I generalized the dual existence of scalar and vectorized functions for all Element types, and all NormConservingPsp types. I introduced macros to define default vectorized functions based on their scalar counterpart, when there is no need for an optimized implementation.

I chose to implement the vectorized functions at the lowest level of Psp, e.g. in PspUpf.jl and PspHgh.jl rather than generalizing for abstract NormConservingPsp. This leaves the choice of scalar or vector by default to the individual pseudopotential implementation. At the NormConservingPsp level, it is just expected that all such pseudos will implement a scalar and vector version, but the details are abstracted. The documentation was adapted accordingly.

Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in Anisotropic strain sensitivity using ForwardDiff. I have been unable to reproduce it locally yet, though.

mfherbst

I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.

abussy · 2026-04-23T14:07:50Z

I reorganized a bit, and I was able to reduce code duplication in elements.jl. Unfortunately, that comes at the cost of extra complication in the generation of vectorized code (extra loop on Element type, excluding ElementPsp)

mfherbst · 2026-06-03T12:27:04Z

@abussy I finally got around to giving this another look. My model was that this

abstract type Element end                                                                          
function evaluate(x::Element, ps::AbstractVector)                                                  
    map(p -> evaluate(x, p), ps)                                                                   
end                                                                                                
                                                                                                   
struct ElementPsp{T} <: Element                                                                    
    inner::T                                                                                       
end                                                                                                
evaluate(x::ElementPsp, ps)                 = evaluate(x.inner, ps)                                
evaluate(x::ElementPsp, ps::AbstractVector) = evaluate(x.inner, ps::AbstractVector)       # <--- This resolves ambiguity
                                                                                                                   
struct ZeroElement <: Element end                                                                                  
evaluate(::ZeroElement, p::Number) = 0                                                                             
                                                                                                                   
abstract type NormConservingPsp end                                                                                
struct OnePsp   <: NormConservingPsp end                                                                           
struct OtherPsp <: NormConservingPsp end                                                                           
                                                                                                                   
function evaluate(x::NormConservingPsp, ps::AbstractVector)                                                        
    map(p -> evaluate(x, p), ps)                                                                                   
end                                                                                                                
                                                                                                                   
evaluate(::OnePsp,   p::Number) = 1                                                                                
evaluate(::OtherPsp, ps::AbstractVector) = copy(ps)                                                                
evaluate(::OtherPsp, p::Number) = p+1                                                                              
                                                                                                                   
                                                                                                                   
@assert evaluate(ElementPsp(OnePsp()), [2, 4]) == [1, 1]                                           
@assert evaluate(ElementPsp(OnePsp()), 2) == 1                                                     
@assert evaluate(ElementPsp(OtherPsp()), [2, 4]) == [2, 4]                                         
@assert evaluate(ElementPsp(OtherPsp()), 2) == 3

works fine, but only after the highlighted line is added. A silicon calculation runs. Let's see what the tests say and then I'll also try it on a GPU.

mfherbst · 2026-06-03T15:09:38Z

Ok I think there is another problem in the logic here. Now NormConservingPsp defines functions pairs like:

  eval_psp_projector_fourier(psp::NormConservingPsp, i, l, p::AbstractVector) =                                                             
       eval_psp_projector_fourier(psp, i, l, norm(p))
                                                                                                   
  # Fallback vectorized implementation for non GPU-optimized code.                                 
  function eval_psp_projector_fourier(psp::NormConservingPsp, i, l,                                
                                      ps::AbstractVector{T}) where {T <: Real}                     
      arch = architecture(ps)                                                                      
      to_device(arch, map(p -> eval_psp_projector_fourier(psp, i, l, p), to_cpu(ps)))              
  end

The idea of the first function was to take the norm for you, for convenience. i.e. you could just call the function on [1, 0, 0] and it would implicitly call the function with norm 1. But we want is to have functionality where you can evaluate the projectors on a list of norms. These two interfaces together have the potential for a lot of confusion and wrong behaviour and we should not have both of them.

Removing the versions which implicitly take the norm (which is what I propose) has the additional advantage that it should remove the method ambiguities and all the annoying macro magic could be removed also at the PSP level.

mfherbst · 2026-06-03T19:09:31Z

@abussy I have not checked the perf implications, but the current code runs on the GPU without any issues and looks reasonably clean.

Could you check this completes the intention of your PR without regression ? If yes, I'd say we merge this.

abussy mentioned this pull request Feb 24, 2026

Optimize AtomicNonlocal term for the GPU #1265

Merged

mfherbst reviewed Feb 25, 2026

View reviewed changes

Comment thread src/pseudo/PspUpf.jl

Comment thread src/workarounds/forwarddiff_rules.jl Outdated

Comment thread src/workarounds/forwarddiff_rules.jl

Comment thread src/density_methods.jl Outdated

abussy mentioned this pull request Feb 26, 2026

Should PSP functions be vectorized by default? #1270

Open

abussy force-pushed the hankel branch from 34f8c3a to 65c5dbe Compare March 10, 2026 10:34

abussy commented Mar 10, 2026

View reviewed changes

Comment thread src/elements.jl Outdated

mfherbst reviewed Apr 22, 2026

View reviewed changes

Comment thread src/elements.jl Outdated

Comment thread src/elements.jl Outdated

Comment thread src/pseudo/NormConservingPsp.jl Outdated

mfherbst reviewed Apr 24, 2026

View reviewed changes

Comment thread src/elements.jl Outdated

Comment thread src/elements.jl Outdated

abussy added 5 commits April 27, 2026 09:30

Optimizing XC instantiation for GPUs

92708ef

Fix PspLinComb test and generic atomic_density()

af6c331

Uniformization of vectorized functions

c65c8ad

Reorganize macros

102079e

Rebased onto current master

3e8d219

abussy force-pushed the hankel branch from e7ec155 to 3e8d219 Compare April 27, 2026 09:22

abussy and others added 3 commits April 28, 2026 17:29

Fix tests

278fb6d

Small change to PspLinComb.jl

370266a

In a simplified test this works, let's see

c7e2f34

Refactor density methods

25643f6

mfherbst added 2 commits June 3, 2026 17:13

Fix tests

90b3786

Nuke macros

b1e0d69

mfherbst added 2 commits June 3, 2026 21:16

Merge branch 'master' into hankel

9c63cb2

Merge branch 'master' into hankel

8c26180

Conversation

abussy commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfherbst commented Feb 25, 2026

Uh oh!

abussy commented Feb 26, 2026

Uh oh!

abussy commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abussy commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

mfherbst commented Jun 3, 2026

Uh oh!

mfherbst commented Jun 3, 2026

Uh oh!

mfherbst commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abussy commented Mar 10, 2026 •

edited

Loading