-
Notifications
You must be signed in to change notification settings - Fork 876
[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Created Final Report
To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible. It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.
I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your idea to return to this project as part of GSOC or just personal interest?
@pcarruscag No, I won't be applying to GSoC any time in the future. Its just that I started a work and I would like to see this through. If anyone is selected for GPU Acceleration this time then I would be more than happy to work and collaborate with them to make sure the project receives as much help as it can. |
…ansfer and standardized file structure
29 Mar 2025 CommitsThe commits currently made aim to provide more control over memory access and transfer so that we can make further changes to the surrounding linear algebra functions which make up the FGMRES Solver. Streamlined Vector Storage in the GPUEach vector definition now also creates its corresponding analog in the GPU Memory space. This is done by allocating memory for it using cudaMalloc when the Initialize function of the CSysVector Class is called. This allows for the continuous storage of vector data in the GPU memory in between calls of the Matrix Vector Product and other linear algebra functions. The previous implementation only allowed this data to persist during a single call of the Matrix Vector Product and it had to be refreshed for each call. This implementation was similar to how the matrix was stored previously. Saving the matrix in pinned memory is also removed due to its huge size as pointed out by earlier feedback. The device pointer can be accessed at any point using a new dedicated public member function (GetDevicePointer) of the CsysVector Class. Added Memory Transfer ControlAs previously discussed, we needed more control over the memory transfer in between calls so that a larger load of the computation could be carried out on the GPU without multiple memory transfers. Now these transfers are carried out by member functions with a flag built into them to decide whether the copy needs to be carried out or not. This flag is set to true by default and does not need to be specified all the time. Further changes are necessary to actually use this flag to decrease the frequency of memory transfer - namely a variable that allows the inner loop of FGMRES to communicate with the MatrixVectorProduct function to know when to switch the flag on or off. This will be added after I port the preconditioner. Minor Change - Standardized File Structure SlightlyRedundant .cuh header files are now gone. I have added a GPUComms.cuh file so that any functions that need to be accessed by all CUDA files - like error checking - can be added here for future references. I've also added GPUVector and GPUMatrix files here - each containing the cuda wrapping member functions for the CSysVector and CSysMatrix class respectively. Please let me know if you notice any bugs or if my implementations can be improved in any way. |
I'll now move forward and start porting the LU_SGS Preconditioner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the merge conflicts so the regression tests can run in this PR?
It shows that merging is blocked (I guess a review is required?) Also please let me know how you wish for me to keep the fork synced with changes in the original - either through merge or rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 address the minor comments + the version updates we just talked about and let's merge 🥳
PR Ready to merge 🥳 @pcarruscag would you like to do the honours? |
Proposed Changes
This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).
The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.
Compilation and Usage
Compile using the following MESON Flag
And activate the functions using the following Config File Option
NOTE ON IMPLEMENTATION
I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.
All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.
Recommendations are most welcome to improve or make this implementation better
PR Checklist
Warning Levels do come (only at level 3) but they are all of the following type
The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?
DOxygen Documentations and config template are all updated.
pre-commit run --all
to format old commits.