Skip to content

Optimizing XC instantiation for GPUs#1262

Open
abussy wants to merge 13 commits into
JuliaMolSim:masterfrom
abussy:hankel
Open

Optimizing XC instantiation for GPUs#1262
abussy wants to merge 13 commits into
JuliaMolSim:masterfrom
abussy:hankel

Conversation

@abussy

@abussy abussy commented Feb 20, 2026

Copy link
Copy Markdown
Collaborator

This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.

Particular care was given to the hankel function in workarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.

For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):

master this PR
A100 65 s 2 s
Mi250 66 s 2 s
using DFTK
using PseudoPotentialData
using CUDA
setup_threading()

arch = DFTK.GPU(CuArray)

Ecut = 50.0
kgrid = [1, 1, 1]
maxiter = 10
tol = 1.0e-10
temperature = 0.01

factor = 3
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

#rattle a few atoms                                                                                  
positions[5] = positions[5] + [0.01, 0.0, -0.01]
positions[10] = positions[10] + [0.01, 0.0, -0.01]
positions[15] = positions[15] + [0.01, 0.0, -0.01]
positions[20] = positions[20] + [0.01, 0.0, -0.01]
positions[25] = positions[25] + [0.01, 0.0, -0.01]

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())
                  
# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, is_converged=ScfConvergenceEnergy(tol))
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)
@show DFTK.timer

Comment thread src/pseudo/PspUpf.jl
Comment thread src/workarounds/forwarddiff_rules.jl Outdated
Comment thread src/workarounds/forwarddiff_rules.jl
Comment thread src/density_methods.jl Outdated
@mfherbst

Copy link
Copy Markdown
Member

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it.

@abussy

abussy commented Feb 26, 2026

Copy link
Copy Markdown
Collaborator Author

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it

I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not.

@abussy

abussy commented Mar 10, 2026

Copy link
Copy Markdown
Collaborator Author

I generalized the dual existence of scalar and vectorized functions for all Element types, and all NormConservingPsp types. I introduced macros to define default vectorized functions based on their scalar counterpart, when there is no need for an optimized implementation.

I chose to implement the vectorized functions at the lowest level of Psp, e.g. in PspUpf.jl and PspHgh.jl rather than generalizing for abstract NormConservingPsp. This leaves the choice of scalar or vector by default to the individual pseudopotential implementation. At the NormConservingPsp level, it is just expected that all such pseudos will implement a scalar and vector version, but the details are abstracted. The documentation was adapted accordingly.

Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in Anisotropic strain sensitivity using ForwardDiff. I have been unable to reproduce it locally yet, though.

Comment thread src/elements.jl Outdated

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.

Comment thread src/elements.jl Outdated
Comment thread src/elements.jl Outdated
Comment thread src/pseudo/NormConservingPsp.jl Outdated
@abussy

abussy commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator Author

I reorganized a bit, and I was able to reduce code duplication in elements.jl. Unfortunately, that comes at the cost of extra complication in the generation of vectorized code (extra loop on Element type, excluding ElementPsp)

Comment thread src/elements.jl Outdated
Comment thread src/elements.jl Outdated
@mfherbst

mfherbst commented Jun 3, 2026

Copy link
Copy Markdown
Member

@abussy I finally got around to giving this another look. My model was that this

abstract type Element end                                                                          
function evaluate(x::Element, ps::AbstractVector)                                                  
    map(p -> evaluate(x, p), ps)                                                                   
end                                                                                                
                                                                                                   
struct ElementPsp{T} <: Element                                                                    
    inner::T                                                                                       
end                                                                                                
evaluate(x::ElementPsp, ps)                 = evaluate(x.inner, ps)                                
evaluate(x::ElementPsp, ps::AbstractVector) = evaluate(x.inner, ps::AbstractVector)       # <--- This resolves ambiguity
                                                                                                                   
struct ZeroElement <: Element end                                                                                  
evaluate(::ZeroElement, p::Number) = 0                                                                             
                                                                                                                   
abstract type NormConservingPsp end                                                                                
struct OnePsp   <: NormConservingPsp end                                                                           
struct OtherPsp <: NormConservingPsp end                                                                           
                                                                                                                   
function evaluate(x::NormConservingPsp, ps::AbstractVector)                                                        
    map(p -> evaluate(x, p), ps)                                                                                   
end                                                                                                                
                                                                                                                   
evaluate(::OnePsp,   p::Number) = 1                                                                                
evaluate(::OtherPsp, ps::AbstractVector) = copy(ps)                                                                
evaluate(::OtherPsp, p::Number) = p+1                                                                              
                                                                                                                   
                                                                                                                   
@assert evaluate(ElementPsp(OnePsp()), [2, 4]) == [1, 1]                                           
@assert evaluate(ElementPsp(OnePsp()), 2) == 1                                                     
@assert evaluate(ElementPsp(OtherPsp()), [2, 4]) == [2, 4]                                         
@assert evaluate(ElementPsp(OtherPsp()), 2) == 3  

works fine, but only after the highlighted line is added. A silicon calculation runs. Let's see what the tests say and then I'll also try it on a GPU.

@mfherbst

mfherbst commented Jun 3, 2026

Copy link
Copy Markdown
Member

Ok I think there is another problem in the logic here. Now NormConservingPsp defines functions pairs like:

  eval_psp_projector_fourier(psp::NormConservingPsp, i, l, p::AbstractVector) =                                                             
       eval_psp_projector_fourier(psp, i, l, norm(p))
                                                                                                   
  # Fallback vectorized implementation for non GPU-optimized code.                                 
  function eval_psp_projector_fourier(psp::NormConservingPsp, i, l,                                
                                      ps::AbstractVector{T}) where {T <: Real}                     
      arch = architecture(ps)                                                                      
      to_device(arch, map(p -> eval_psp_projector_fourier(psp, i, l, p), to_cpu(ps)))              
  end   

The idea of the first function was to take the norm for you, for convenience. i.e. you could just call the function on [1, 0, 0] and it would implicitly call the function with norm 1. But we want is to have functionality where you can evaluate the projectors on a list of norms. These two interfaces together have the potential for a lot of confusion and wrong behaviour and we should not have both of them.

Removing the versions which implicitly take the norm (which is what I propose) has the additional advantage that it should remove the method ambiguities and all the annoying macro magic could be removed also at the PSP level.

@mfherbst

mfherbst commented Jun 3, 2026

Copy link
Copy Markdown
Member

@abussy I have not checked the perf implications, but the current code runs on the GPU without any issues and looks reasonably clean.

Could you check this completes the intention of your PR without regression ? If yes, I'd say we merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants