I think we can probably simplify the GPU code quite a bit. - [ ] Look at maybe simplifying error handling using the examples in the CUDA developer docs. - [ ] Use streams to improve performance. - [ ] Get CUFFT working with MPI and multi-GPU (hopefully we can do this if we have one rank orchestrate things). - [ ] Allow Mhysa to specify what GPU each rank should connect to, that would allow us to do away with the use of MPS.
I think we can probably simplify the GPU code quite a bit.