Skip to content

GPU optimization of LOBPCG#1068

Merged
mfherbst merged 12 commits into
JuliaMolSim:masterfrom
abussy:lobpcg
May 15, 2025
Merged

GPU optimization of LOBPCG#1068
mfherbst merged 12 commits into
JuliaMolSim:masterfrom
abussy:lobpcg

Conversation

@abussy

@abussy abussy commented Feb 24, 2025

Copy link
Copy Markdown
Collaborator

This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.

In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.

I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.

Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.

Finally, I scrapped 2 loops using DFTK's custom threading (in ortho! X vs Y and ldiv! for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the default n_DFTK = n_blas thread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
@antoine-levitt

Copy link
Copy Markdown
Member

Hm, I'm kind of split on this.

  • Looks like GPU style clashes with threading in a number of places. Threading has been mostly an annoyance, and for very limited results. I think I'm fine with just scrapping threading from DFTK altogether, and pointing people to GPU if they want to do large-scale.
  • It's pretty unfortunate if GPU forces us towards a matlab/numpy style. Those dot products are not very nice... I'd rather use things like eachcol, map, etc.

Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments.

@mfherbst

Copy link
Copy Markdown
Member

Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:

  • Avoid code duplication: I think it's better to have a bit of ugly code at selected places (with nice comments and a common convention of course) rather than a GPU and a CPU version with different code. Maybe for some primitives we use often (e.g. column-wise norms or outer products) we can invent functions that are easier to understand and hide the array syntax a little.

  • Avoid performance regressions on CPU: For small arrays this may not matter but we should test on non-trivial calculations if the extra temporaries don't end up killing us

@abussy

abussy commented Mar 24, 2025

Copy link
Copy Markdown
Collaborator Author

I went back and worked on this. Here are the changes:

  • Single comment on GPU performance style at the beginning of lobpcg_hyper_impl.jl rather than inconsistently scattered GPU performance comments
  • Introduced a function to calculate norms of column vectors, since used multiple times
  • Modified the SCF guess interpolation in interpolation.jl to make it GPU efficient (following example from fft.jl). This significantly reduces the time spent in the first SCF step.
  • used the to_cpu function where possible

Is it clear that this way of doing is faster? Eg on CPU you're possibly allocating big arrays and losing out on cache locality?

I carefully tested this in the CPU, and indeed, there is a performance hit on CPU with large systems. I reverted this change back to the original, since the CPU hit is greater than the GPU gain. This is not a massive issue, since this operation is only done once per LOBPCG call (compared to norms, that can be calculated 1000s of times).

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not great, but within what we discussed looks like a good solution to me.

@antoine-levitt may also have some thoughts.

@mfherbst

Copy link
Copy Markdown
Member

If there really turns out to be a bad CPU perf issue can we specialise for CPU maybe, but certainly I'd try to avoid that.

@antoine-levitt antoine-levitt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Not very satisfactory, but pluses outweigh minuses, unfortunately...

Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/lobpcg_hyper_impl.jl Outdated
Comment thread src/eigen/preconditioners.jl Outdated
Comment thread src/eigen/preconditioners.jl Outdated
Comment thread src/eigen/preconditioners.jl Outdated
@antoine-levitt

Copy link
Copy Markdown
Member

Also careful in refactoring that lobpcg should be self-contained (not use too much stuff from other parts of DFTK) so that users can steal it for their own projects. We really should pull it off to its own project and depend on it. We had a plan to integrate it with either iterativesolvers or krylovkit but I think we should just give up on that and split it off.

@mfherbst

Copy link
Copy Markdown
Member

Agree with trying to keep LOBPCG self-contained (and split off hopefully soonish) and I think a diagprod function would be good

@abussy

abussy commented Mar 25, 2025

Copy link
Copy Markdown
Collaborator Author

Sure, then I'll define a local to_cpu function.

I've run additional tests, and I think we cannot escape some level of CPU/GPU code separation. Out of all the changes I have proposed, only the calculation of the norms is consistently faster on both GPU and CPU.

I think the best course of action is to define alternative GPU optimized code in DFTCUDAExt.jl (which should maybe become a directory), while keeping the easy to read Julian code elsewhere. When the time comes to split LOBPCG into its own project, it would inherit this CUDA extension as well. As much as possible, I would try to do this for low level functions.

@mfherbst

Copy link
Copy Markdown
Member

Sounds very reasonable, thanks @abussy .

@abussy

abussy commented Apr 4, 2025

Copy link
Copy Markdown
Collaborator Author

With this latest commit, I did the following:

  • moved performance critical code into low-level functions (e.g. compute_λ, diag_prod), written in a CPU and Human friendly manner
  • Changed the DFTKCUDAExt to a directory, in which the above function are re-implemented in a GPU optimized way. They will be called appropriately with multiple dispatch.

In the spirit of splitting LOBPCG from DFTK in the future, I also removed usage of DFTK's explicit threading. Note that I kept using DFTK's to_cpu function in order to avoid redefinition warnings. Since this is a one line function, it will be easy to move it to a LOBPCG module.

Finally, I kept the calculation of norms in the function columnwise_norms as an array operation, as it performs better than norm.(eachcol(X)), even on the CPU (marginally so, but still).

@abussy

abussy commented May 12, 2025

Copy link
Copy Markdown
Collaborator Author

Can I get some feedback on this PR? I have a couple of other GPU optimization contributions that would benefit from this being merged first.

@mfherbst mfherbst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but mostly fine.

Comment thread src/eigen/preconditioners.jl Outdated
Comment thread src/eigen/preconditioners.jl Outdated
Comment thread src/eigen/preconditioners.jl Outdated
Comment thread src/eigen/preconditioners.jl
@abussy

abussy commented May 12, 2025

Copy link
Copy Markdown
Collaborator Author

Implemented most of the points brought up by @mfherbst:

  • Moved help linear algebra functions into their own file in src/eigen/linalg.jl
  • Reintroduced DFTK thread parallelization in the preconditioner: my measurements show that this is faster than BLAS threading alone, and if it stays in DFTK, it is not a problem. Overall, this has a close to negligible impact on timings.
  • Wrote separate functions for 2- and 3-arguments diag_prod

I however did not change the PreconditionerTPA struct yet. Forcing the mean_kin field to be an AbstracVector{T} complicates the handling of the preconditioner application in the special case when there is only a default shift. Currently, we have

if P.mean_kin === nothing
    ldiv!(Y, Diagonal(P.kin .+ P.default_shift), R)
else

but to get the same with the above constraint, I do not find better than

if length(P.mean_kin) == 1 && P.mean_kin[1] == P.default_shift
    ldiv!(Y, Diagonal(P.kin .+ P.default_shift), R)
else

which looks very clunky to me.

Comment thread src/eigen/linalg.jl Outdated
@mfherbst

Copy link
Copy Markdown
Member

I however did not change the PreconditionerTPA struct yet. Forcing the mean_kin field to be an AbstracVector{T} complicates the handling of the preconditioner application in the special case when there is only a default shift. Currently, we have

Yeah ok, but I'd still use type templates instead of the AbstractArray objects as antoine suggested.

@abussy

abussy commented May 13, 2025

Copy link
Copy Markdown
Collaborator Author
  • Made diag_prod signature more like dot by taking a matrix rather than a vector
  • Parametrized type of kin and mean_kin in PreconditionerTPA

@antoine-levitt antoine-levitt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment thread ext/DFTKCUDAExt/lobpcg.jl
Comment thread src/eigen/linalg.jl
Comment thread src/eigen/linalg.jl Outdated
@mfherbst

Copy link
Copy Markdown
Member

A general point that came to me: These GPU specific optimisations should be helpful for AMD as well, right? So perhaps we should not make the types CuArrays but GpuArrays (and thus not put the code into the extmodule)?

@abussy

abussy commented May 14, 2025

Copy link
Copy Markdown
Collaborator Author

Indeed, these optimizations should be vendor agnostic. I have access to some Mi200 GPUs, I'll try to run there and confirm it works/performs. If so, I suggest creating a new subdirectory in src/common to put the generic GPU optimizations (maybe src/common/gpu_opts). Then the DFTKCUDAExt could also be reverted to a single file.

@mfherbst

Copy link
Copy Markdown
Member

Yeah or src/common/linalg.jl or src/eigen/linalg.jl with both versions: CPU and GPU.

@abussy

abussy commented May 14, 2025

Copy link
Copy Markdown
Collaborator Author

I switched to a more generic implementation using AbstractGPUArray, and moved the code back to DFTK's src. I also checked that it works both on NVIDIA and AMD GPUs.

Regarding the location, I put it in src/common/gpu_opts/lobpcg.jl, rather than in src/eigen/linalg.jl with the rest of the helper functions. My reasoning is that there will be more of these coming (e.g. symmetry in the next PR), and I think it is easier to navigate if all GPU specific implementations are in the same place.

@mfherbst

mfherbst commented May 14, 2025

Copy link
Copy Markdown
Member

Regarding the location, I put it in src/common/gpu_opts/lobpcg.jl, rather than in src/eigen/linalg.jl with the rest of the helper functions. My reasoning is that there will be more of these coming (e.g. symmetry in the next PR), and I think it is easier to navigate if all GPU specific implementations are in the same place.

Ok, I see. Then I'd not have this in common and rather use a separate subdirectory src/gpu with all these files inside. I would also move src/workarounds/gpu_arrays.jl in there. Maybe good to keep the naming between cpu and gpu consistent, i.e. if there is src/eigen/linalg.jl there should be src/gpu/linalg.jl with the respective GPU things ?

@abussy

abussy commented May 14, 2025

Copy link
Copy Markdown
Collaborator Author

Maybe good to keep the naming between cpu and gpu consistent, i.e. if there is src/eigen/linalg.jl there should be src/gpu/linalg.jl with the respective GPU things ?

That would probably be the clearest solution. Only caveat is possible file proliferation: linalg.jl, lobpcg.jl and preconditioner.jl for this PR alone. I think it is ok, since writing GPU specific code is only done when truly necessary.

@mfherbst

Copy link
Copy Markdown
Member

Only caveat is possible file proliferation

True, but to me the preconditioner and lobpcg GPU things make sense to just be in a file src/gpu/linalg.jl ... but whatever, probably not too important after all. I think what matters is that the GPU functions should not start to scatter around multiple locations in the code base, but are in one clearly identified file or folder.

abussy and others added 3 commits May 15, 2025 10:24
Improves inlining and all accesses with `[ ]` involve an index check anyway.
Removed size checks: Functions will error out if sizes don't match.
@mfherbst

Copy link
Copy Markdown
Member

I just removed the assertions for the size checks. Such functions are pretty low-level and either the called functions anyway perform such size checks or the size agreement is ensured by the surrounding algorithm. Such @assert insertions will in general only discourage the compiler from inlineing.

@abussy Thanks very much for the good work, laying the foundation for more GPU improvements in the future.

@mfherbst mfherbst enabled auto-merge (squash) May 15, 2025 08:33
@mfherbst mfherbst merged commit 359996b into JuliaMolSim:master May 15, 2025
6 of 8 checks passed
Comment thread src/eigen/linalg.jl
Comment on lines +7 to +15
@views function columnwise_dots(A::AbstractArray{T}, B::AbstractArray{T}) where {T}
[real(dot(A[:, i], B[:, i])) for i = 1:size(A, 2)]
end

# Returns a vector of real(dot(A[:, i], M, B[:, i])), for all columns of
# A, B, and matrix M
@views function columnwise_dots(A::AbstractArray{T}, M, B::AbstractArray{T}) where {T}
[real(dot(A[:, i], M, B[:, i])) for i = 1:size(A, 2)]
end

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for the real calls here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(asking because I don't see them in the GPU version - did I miss something?)

@abussy abussy Jul 9, 2025

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's most likely a mistake. I assume it comes from the original preconditioner code:

function precondprep!(P::PreconditionerTPA, X::AbstractArray)
P.mean_kin = [real(dot(x, Diagonal(P.kin), x)) for x in eachcol(X)]
end

But real is still taken there now, so I think it can go.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to open an issue to remember?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of it immediately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants