GPU optimization of LOBPCG by abussy · Pull Request #1068 · JuliaMolSim/DFTK.jl

abussy · 2025-02-24T14:50:22Z

This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.

In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.

I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.

Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.

Finally, I scrapped 2 loops using DFTK's custom threading (in ortho! X vs Y and ldiv! for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the default n_DFTK = n_blas thread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.

antoine-levitt · 2025-02-24T16:16:52Z

Hm, I'm kind of split on this.

Looks like GPU style clashes with threading in a number of places. Threading has been mostly an annoyance, and for very limited results. I think I'm fine with just scrapping threading from DFTK altogether, and pointing people to GPU if they want to do large-scale.
It's pretty unfortunate if GPU forces us towards a matlab/numpy style. Those dot products are not very nice... I'd rather use things like eachcol, map, etc.

Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments.

mfherbst · 2025-02-26T15:22:00Z

Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:

Avoid code duplication: I think it's better to have a bit of ugly code at selected places (with nice comments and a common convention of course) rather than a GPU and a CPU version with different code. Maybe for some primitives we use often (e.g. column-wise norms or outer products) we can invent functions that are easier to understand and hide the array syntax a little.
Avoid performance regressions on CPU: For small arrays this may not matter but we should test on non-trivial calculations if the extra temporaries don't end up killing us

abussy · 2025-03-24T16:55:15Z

I went back and worked on this. Here are the changes:

Single comment on GPU performance style at the beginning of lobpcg_hyper_impl.jl rather than inconsistently scattered GPU performance comments
Introduced a function to calculate norms of column vectors, since used multiple times
Modified the SCF guess interpolation in interpolation.jl to make it GPU efficient (following example from fft.jl). This significantly reduces the time spent in the first SCF step.
used the to_cpu function where possible

Is it clear that this way of doing is faster? Eg on CPU you're possibly allocating big arrays and losing out on cache locality?

I carefully tested this in the CPU, and indeed, there is a performance hit on CPU with large systems. I reverted this change back to the original, since the CPU hit is greater than the GPU gain. This is not a massive issue, since this operation is only done once per LOBPCG call (compared to norms, that can be calculated 1000s of times).

mfherbst

Not great, but within what we discussed looks like a good solution to me.

@antoine-levitt may also have some thoughts.

mfherbst · 2025-03-25T09:02:59Z

If there really turns out to be a bad CPU perf issue can we specialise for CPU maybe, but certainly I'd try to avoid that.

antoine-levitt

Thanks for the PR. Not very satisfactory, but pluses outweigh minuses, unfortunately...

antoine-levitt · 2025-03-25T12:22:54Z

Also careful in refactoring that lobpcg should be self-contained (not use too much stuff from other parts of DFTK) so that users can steal it for their own projects. We really should pull it off to its own project and depend on it. We had a plan to integrate it with either iterativesolvers or krylovkit but I think we should just give up on that and split it off.

mfherbst · 2025-03-25T14:53:52Z

Agree with trying to keep LOBPCG self-contained (and split off hopefully soonish) and I think a diagprod function would be good

abussy · 2025-03-25T15:09:10Z

Sure, then I'll define a local to_cpu function.

I've run additional tests, and I think we cannot escape some level of CPU/GPU code separation. Out of all the changes I have proposed, only the calculation of the norms is consistently faster on both GPU and CPU.

I think the best course of action is to define alternative GPU optimized code in DFTCUDAExt.jl (which should maybe become a directory), while keeping the easy to read Julian code elsewhere. When the time comes to split LOBPCG into its own project, it would inherit this CUDA extension as well. As much as possible, I would try to do this for low level functions.

mfherbst · 2025-03-26T12:59:13Z

Sounds very reasonable, thanks @abussy .

abussy · 2025-04-04T07:42:00Z

With this latest commit, I did the following:

moved performance critical code into low-level functions (e.g. compute_λ, diag_prod), written in a CPU and Human friendly manner
Changed the DFTKCUDAExt to a directory, in which the above function are re-implemented in a GPU optimized way. They will be called appropriately with multiple dispatch.

In the spirit of splitting LOBPCG from DFTK in the future, I also removed usage of DFTK's explicit threading. Note that I kept using DFTK's to_cpu function in order to avoid redefinition warnings. Since this is a one line function, it will be easy to move it to a LOBPCG module.

Finally, I kept the calculation of norms in the function columnwise_norms as an array operation, as it performs better than norm.(eachcol(X)), even on the CPU (marginally so, but still).

abussy · 2025-05-12T09:38:37Z

Can I get some feedback on this PR? I have a couple of other GPU optimization contributions that would benefit from this being merged first.

mfherbst

Some small comments, but mostly fine.

abussy · 2025-05-12T16:27:29Z

Implemented most of the points brought up by @mfherbst:

Moved help linear algebra functions into their own file in src/eigen/linalg.jl
Reintroduced DFTK thread parallelization in the preconditioner: my measurements show that this is faster than BLAS threading alone, and if it stays in DFTK, it is not a problem. Overall, this has a close to negligible impact on timings.
Wrote separate functions for 2- and 3-arguments diag_prod

I however did not change the PreconditionerTPA struct yet. Forcing the mean_kin field to be an AbstracVector{T} complicates the handling of the preconditioner application in the special case when there is only a default shift. Currently, we have

if P.mean_kin === nothing
    ldiv!(Y, Diagonal(P.kin .+ P.default_shift), R)
else

but to get the same with the above constraint, I do not find better than

if length(P.mean_kin) == 1 && P.mean_kin[1] == P.default_shift
    ldiv!(Y, Diagonal(P.kin .+ P.default_shift), R)
else

which looks very clunky to me.

mfherbst · 2025-05-13T06:41:29Z

I however did not change the PreconditionerTPA struct yet. Forcing the mean_kin field to be an AbstracVector{T} complicates the handling of the preconditioner application in the special case when there is only a default shift. Currently, we have

Yeah ok, but I'd still use type templates instead of the AbstractArray objects as antoine suggested.

abussy · 2025-05-13T13:53:22Z

Made diag_prod signature more like dot by taking a matrix rather than a vector
Parametrized type of kin and mean_kin in PreconditionerTPA

antoine-levitt

LGTM, thanks!

mfherbst · 2025-05-14T06:20:50Z

A general point that came to me: These GPU specific optimisations should be helpful for AMD as well, right? So perhaps we should not make the types CuArrays but GpuArrays (and thus not put the code into the extmodule)?

abussy · 2025-05-14T08:39:47Z

Indeed, these optimizations should be vendor agnostic. I have access to some Mi200 GPUs, I'll try to run there and confirm it works/performs. If so, I suggest creating a new subdirectory in src/common to put the generic GPU optimizations (maybe src/common/gpu_opts). Then the DFTKCUDAExt could also be reverted to a single file.

mfherbst · 2025-05-14T09:48:32Z

Yeah or src/common/linalg.jl or src/eigen/linalg.jl with both versions: CPU and GPU.

abussy · 2025-05-14T13:27:32Z

I switched to a more generic implementation using AbstractGPUArray, and moved the code back to DFTK's src. I also checked that it works both on NVIDIA and AMD GPUs.

Regarding the location, I put it in src/common/gpu_opts/lobpcg.jl, rather than in src/eigen/linalg.jl with the rest of the helper functions. My reasoning is that there will be more of these coming (e.g. symmetry in the next PR), and I think it is easier to navigate if all GPU specific implementations are in the same place.

mfherbst · 2025-05-14T13:56:15Z

Regarding the location, I put it in src/common/gpu_opts/lobpcg.jl, rather than in src/eigen/linalg.jl with the rest of the helper functions. My reasoning is that there will be more of these coming (e.g. symmetry in the next PR), and I think it is easier to navigate if all GPU specific implementations are in the same place.

Ok, I see. Then I'd not have this in common and rather use a separate subdirectory src/gpu with all these files inside. I would also move src/workarounds/gpu_arrays.jl in there. Maybe good to keep the naming between cpu and gpu consistent, i.e. if there is src/eigen/linalg.jl there should be src/gpu/linalg.jl with the respective GPU things ?

abussy · 2025-05-14T15:10:12Z

Maybe good to keep the naming between cpu and gpu consistent, i.e. if there is src/eigen/linalg.jl there should be src/gpu/linalg.jl with the respective GPU things ?

That would probably be the clearest solution. Only caveat is possible file proliferation: linalg.jl, lobpcg.jl and preconditioner.jl for this PR alone. I think it is ok, since writing GPU specific code is only done when truly necessary.

mfherbst · 2025-05-14T16:26:19Z

Only caveat is possible file proliferation

True, but to me the preconditioner and lobpcg GPU things make sense to just be in a file src/gpu/linalg.jl ... but whatever, probably not too important after all. I think what matters is that the GPU functions should not start to scatter around multiple locations in the code base, but are in one clearly identified file or folder.

Improves inlining and all accesses with `[ ]` involve an index check anyway.

Removed size checks: Functions will error out if sizes don't match.

mfherbst · 2025-05-15T08:33:11Z

I just removed the assertions for the size checks. Such functions are pretty low-level and either the called functions anyway perform such size checks or the size agreement is ensured by the surrounding algorithm. Such @assert insertions will in general only discourage the compiler from inlineing.

@abussy Thanks very much for the good work, laying the foundation for more GPU improvements in the future.

Technici4n · 2025-07-09T13:59:23Z

+@views function columnwise_dots(A::AbstractArray{T}, B::AbstractArray{T}) where {T}
+    [real(dot(A[:, i], B[:, i])) for i = 1:size(A, 2)]
+end
+
+# Returns a vector of real(dot(A[:, i], M, B[:, i])), for all columns of
+# A, B, and matrix M
+@views function columnwise_dots(A::AbstractArray{T}, M, B::AbstractArray{T}) where {T}
+    [real(dot(A[:, i], M, B[:, i])) for i = 1:size(A, 2)]
+end


What is the reasoning for the real calls here?

(asking because I don't see them in the GPU version - did I miss something?)

I think that's most likely a mistake. I assume it comes from the original preconditioner code:

DFTK.jl/src/eigen/preconditioners.jl

Lines 75 to 77 in 3c34d38

function precondprep!(P::PreconditionerTPA, X::AbstractArray)

P.mean_kin = [real(dot(x, Diagonal(P.kin), x)) for x in eachcol(X)]

end

But real is still taken there now, so I think it can go.

Do you want me to open an issue to remember?

I'll take care of it immediately

antoine-levitt reviewed Feb 24, 2025

View reviewed changes

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated

abussy added 2 commits March 24, 2025 17:44

GPU optimization of LOBPCG

ca57f7f

Refining LOBPCG GPU opt

a17f98e

abussy force-pushed the lobpcg branch from d6a6f96 to a17f98e Compare March 24, 2025 16:45

mfherbst reviewed Mar 25, 2025

View reviewed changes

antoine-levitt reviewed Mar 25, 2025

View reviewed changes

Move GPU optimized functions to DFTKCUDAExt

9aadeb6

mfherbst reviewed May 12, 2025

View reviewed changes

Comment thread src/eigen/preconditioners.jl Outdated

Comment thread src/eigen/preconditioners.jl Outdated

Comment thread src/eigen/preconditioners.jl Outdated

Comment thread src/eigen/preconditioners.jl

Suggested changes

67b964e

mfherbst reviewed May 13, 2025

View reviewed changes

Comment thread src/eigen/linalg.jl Outdated

Suggested changes

09e3f11

antoine-levitt reviewed May 13, 2025

View reviewed changes

Comment thread ext/DFTKCUDAExt/lobpcg.jl

Comment thread src/eigen/linalg.jl

Comment thread src/eigen/linalg.jl Outdated

abussy force-pushed the lobpcg branch from 7b44257 to 09e3f11 Compare May 14, 2025 12:44

Use AbstractGPUArray for generic GPU optimization

ee2d754

abussy and others added 3 commits May 15, 2025 10:24

Move all GPU optimizations to src/gpu

7ca41b5

Removed size checks

e627f8f

Improves inlining and all accesses with `[ ]` involve an index check anyway.

Update linalg.jl

1db3617

Removed size checks: Functions will error out if sizes don't match.

Merge branch 'master' into lobpcg

ec588b3

mfherbst enabled auto-merge (squash) May 15, 2025 08:33

mfherbst added 2 commits May 15, 2025 10:39

Merge branch 'master' into lobpcg

d19ac52

ups

e7c98cb

mfherbst merged commit 359996b into JuliaMolSim:master May 15, 2025
6 of 8 checks passed

Technici4n reviewed Jul 9, 2025

View reviewed changes

abussy mentioned this pull request Jul 9, 2025

Fix inconsistency between CPU and GPU in LOBPCG #1120

Merged

abussy mentioned this pull request Jul 31, 2025

Use columnwise_dots where applicable #1127

Merged

niklasschmitz pushed a commit that referenced this pull request Nov 17, 2025

GPU optimization of LOBPCG (#1068)

4e44e5b

	function precondprep!(P::PreconditionerTPA, X::AbstractArray)
	P.mean_kin = [real(dot(x, Diagonal(P.kin), x)) for x in eachcol(X)]
	end

Conversation

abussy commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoine-levitt commented Feb 24, 2025

Uh oh!

mfherbst commented Feb 26, 2025

Uh oh!

abussy commented Mar 24, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

mfherbst commented Mar 25, 2025

Uh oh!

antoine-levitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoine-levitt commented Mar 25, 2025

Uh oh!

mfherbst commented Mar 25, 2025

Uh oh!

abussy commented Mar 25, 2025

Uh oh!

mfherbst commented Mar 26, 2025

Uh oh!

abussy commented Apr 4, 2025

Uh oh!

abussy commented May 12, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abussy commented May 12, 2025

Uh oh!

Uh oh!

mfherbst commented May 13, 2025

Uh oh!

abussy commented May 13, 2025

Uh oh!

antoine-levitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfherbst commented May 14, 2025

Uh oh!

abussy commented May 14, 2025

Uh oh!

mfherbst commented May 14, 2025

Uh oh!

abussy commented May 14, 2025

Uh oh!

mfherbst commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abussy commented May 14, 2025

Uh oh!

mfherbst commented May 14, 2025

Uh oh!

mfherbst commented May 15, 2025

Uh oh!

Uh oh!

mfherbst commented May 14, 2025 •

edited

Loading

abussy Jul 9, 2025 •

edited

Loading