GPU optimized symmetry operations#1097
Conversation
|
Changing the order of the loops reduces the number of kernel calls and large copies. As a result, performance is greatly improved:
Note that this introduces some level of clunkiness. For example, all data accessed within the |
mfherbst
left a comment
There was a problem hiding this comment.
Nice. Your current implementation, spiked another idea: Fuse both functions. If it's too complicated don't bother, but I think it should work.
|
@abussy Failing test is the |
|
There is something off with the way that test is set up. If I switch the order of these 2 lines, suddenly the test passes. The actual numbers also change if I do a warmup before the test (calling the same 2 lines without the My interpretation is that, since the In any case, that test is not well designed (that's on me). I'd suggest we get rid of it. |
|
The Yeah I agree the test does not really make sense here any more. Is it still worth it to test that |
I guess we can. However it might be good to keep a trace of it, in case a similarly slow iterations pop up in the future. The function itself does not crowd to code base too much.
I think that testing inlining cannot hurt. I modified the test to analyze the output of |
When simulating large systems with symmetries, the symmetrization of the density is extremely slow. This is because the array is first transferred to the CPU, before a double loop over symmetries and G vectors takes place. By comparison to loops running on the GPU, this is slow, and the operation becomes a major bottleneck.
This PR introduces GPU specific implementations for
accumulate_over_symmetries!andlowpass_for_symemtry!, using themap!construct to run on the device, and thus saving data transfers. The loop itself is also greatly accelerated. Note that these new implementations do run on the CPU, but slower than the current solution.For illustration, using the input below, I observe these timings (seconds):