GPU optimization of LOBPCG#1068
Conversation
|
Hm, I'm kind of split on this.
Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments. |
|
Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:
|
|
I went back and worked on this. Here are the changes:
I carefully tested this in the CPU, and indeed, there is a performance hit on CPU with large systems. I reverted this change back to the original, since the CPU hit is greater than the GPU gain. This is not a massive issue, since this operation is only done once per |
mfherbst
left a comment
There was a problem hiding this comment.
Not great, but within what we discussed looks like a good solution to me.
@antoine-levitt may also have some thoughts.
|
If there really turns out to be a bad CPU perf issue can we specialise for CPU maybe, but certainly I'd try to avoid that. |
antoine-levitt
left a comment
There was a problem hiding this comment.
Thanks for the PR. Not very satisfactory, but pluses outweigh minuses, unfortunately...
|
Also careful in refactoring that lobpcg should be self-contained (not use too much stuff from other parts of DFTK) so that users can steal it for their own projects. We really should pull it off to its own project and depend on it. We had a plan to integrate it with either iterativesolvers or krylovkit but I think we should just give up on that and split it off. |
|
Agree with trying to keep LOBPCG self-contained (and split off hopefully soonish) and I think a |
|
Sure, then I'll define a local I've run additional tests, and I think we cannot escape some level of CPU/GPU code separation. Out of all the changes I have proposed, only the calculation of the norms is consistently faster on both GPU and CPU. I think the best course of action is to define alternative GPU optimized code in |
|
Sounds very reasonable, thanks @abussy . |
|
With this latest commit, I did the following:
In the spirit of splitting LOBPCG from DFTK in the future, I also removed usage of DFTK's explicit threading. Note that I kept using DFTK's Finally, I kept the calculation of norms in the function |
|
Can I get some feedback on this PR? I have a couple of other GPU optimization contributions that would benefit from this being merged first. |
mfherbst
left a comment
There was a problem hiding this comment.
Some small comments, but mostly fine.
|
Implemented most of the points brought up by @mfherbst:
I however did not change the but to get the same with the above constraint, I do not find better than which looks very clunky to me. |
Yeah ok, but I'd still use type templates instead of the |
|
|
A general point that came to me: These GPU specific optimisations should be helpful for AMD as well, right? So perhaps we should not make the types CuArrays but GpuArrays (and thus not put the code into the extmodule)? |
|
Indeed, these optimizations should be vendor agnostic. I have access to some Mi200 GPUs, I'll try to run there and confirm it works/performs. If so, I suggest creating a new subdirectory in |
|
Yeah or src/common/linalg.jl or src/eigen/linalg.jl with both versions: CPU and GPU. |
|
I switched to a more generic implementation using Regarding the location, I put it in |
Ok, I see. Then I'd not have this in common and rather use a separate subdirectory |
That would probably be the clearest solution. Only caveat is possible file proliferation: |
True, but to me the preconditioner and lobpcg GPU things make sense to just be in a file |
Improves inlining and all accesses with `[ ]` involve an index check anyway.
Removed size checks: Functions will error out if sizes don't match.
|
I just removed the assertions for the size checks. Such functions are pretty low-level and either the called functions anyway perform such size checks or the size agreement is ensured by the surrounding algorithm. Such @abussy Thanks very much for the good work, laying the foundation for more GPU improvements in the future. |
| @views function columnwise_dots(A::AbstractArray{T}, B::AbstractArray{T}) where {T} | ||
| [real(dot(A[:, i], B[:, i])) for i = 1:size(A, 2)] | ||
| end | ||
|
|
||
| # Returns a vector of real(dot(A[:, i], M, B[:, i])), for all columns of | ||
| # A, B, and matrix M | ||
| @views function columnwise_dots(A::AbstractArray{T}, M, B::AbstractArray{T}) where {T} | ||
| [real(dot(A[:, i], M, B[:, i])) for i = 1:size(A, 2)] | ||
| end |
There was a problem hiding this comment.
What is the reasoning for the real calls here?
There was a problem hiding this comment.
(asking because I don't see them in the GPU version - did I miss something?)
There was a problem hiding this comment.
I think that's most likely a mistake. I assume it comes from the original preconditioner code:
DFTK.jl/src/eigen/preconditioners.jl
Lines 75 to 77 in 3c34d38
But
real is still taken there now, so I think it can go.
There was a problem hiding this comment.
Do you want me to open an issue to remember?
There was a problem hiding this comment.
I'll take care of it immediately
This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.
In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.
I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.
Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.
Finally, I scrapped 2 loops using DFTK's custom threading (in
ortho! X vs Yandldiv!for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the defaultn_DFTK=n_blasthread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.