Skip to content

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 24, 2025

Conversation

areenraj
Copy link

@areenraj areenraj commented Aug 27, 2024

Proposed Changes

This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).

The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.

Compilation and Usage

Compile using the following MESON Flag

-Denable-cuda=true

And activate the functions using the following Config File Option

ENABLE_CUDA=YES

NOTE ON IMPLEMENTATION

I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.

All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.

Recommendations are most welcome to improve or make this implementation better

PR Checklist

Warning Levels do come (only at level 3) but they are all of the following type

style of line directive is a GCC extension

The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?

DOxygen Documentations and config template are all updated.

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

@kursatyurt
Copy link
Contributor

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible.
Those libraries generally exploit state-of-the-art solution like mixed-precision computing. A gaming GPU is not way faster than a good CPU in double precision, but way faster in single precision, most of them have 64:1 ratio, however server class GPU have 2:1 ratio. Also when available they use vendor libraries like cuBLAS or hipBLAS. It is always nice to have you only care about connection and somebody else handle the solver as performant as possible. In future probably they will provide more and more solvers and it will be automagically works.

It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc.

Copy link
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your idea to return to this project as part of GSOC or just personal interest?

@areenraj
Copy link
Author

Is your idea to return to this project as part of GSOC or just personal interest?

@pcarruscag No, I won't be applying to GSoC any time in the future. Its just that I started a work and I would like to see this through. If anyone is selected for GPU Acceleration this time then I would be more than happy to work and collaborate with them to make sure the project receives as much help as it can.

@areenraj
Copy link
Author

29 Mar 2025 Commits

The commits currently made aim to provide more control over memory access and transfer so that we can make further changes to the surrounding linear algebra functions which make up the FGMRES Solver.

Streamlined Vector Storage in the GPU

Each vector definition now also creates its corresponding analog in the GPU Memory space. This is done by allocating memory for it using cudaMalloc when the Initialize function of the CSysVector Class is called. This allows for the continuous storage of vector data in the GPU memory in between calls of the Matrix Vector Product and other linear algebra functions. The previous implementation only allowed this data to persist during a single call of the Matrix Vector Product and it had to be refreshed for each call.

This implementation was similar to how the matrix was stored previously. Saving the matrix in pinned memory is also removed due to its huge size as pointed out by earlier feedback. The device pointer can be accessed at any point using a new dedicated public member function (GetDevicePointer) of the CsysVector Class.

Added Memory Transfer Control

As previously discussed, we needed more control over the memory transfer in between calls so that a larger load of the computation could be carried out on the GPU without multiple memory transfers. Now these transfers are carried out by member functions with a flag built into them to decide whether the copy needs to be carried out or not. This flag is set to true by default and does not need to be specified all the time.

Further changes are necessary to actually use this flag to decrease the frequency of memory transfer - namely a variable that allows the inner loop of FGMRES to communicate with the MatrixVectorProduct function to know when to switch the flag on or off. This will be added after I port the preconditioner.

Minor Change - Standardized File Structure Slightly

Redundant .cuh header files are now gone. I have added a GPUComms.cuh file so that any functions that need to be accessed by all CUDA files - like error checking - can be added here for future references. I've also added GPUVector and GPUMatrix files here - each containing the cuda wrapping member functions for the CSysVector and CSysMatrix class respectively.

Please let me know if you notice any bugs or if my implementations can be improved in any way.

@areenraj
Copy link
Author

I'll now move forward and start porting the LU_SGS Preconditioner

Copy link
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the merge conflicts so the regression tests can run in this PR?

@areenraj
Copy link
Author

areenraj commented May 8, 2025

Can you fix the merge conflicts so the regression tests can run in this PR?

It shows that merging is blocked (I guess a review is required?) Also please let me know how you wish for me to keep the fork synced with changes in the original - either through merge or rebase.

@areenraj areenraj added the GSoC Google Summer of Code label Jun 9, 2025
Copy link
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 address the minor comments + the version updates we just talked about and let's merge 🥳

@areenraj
Copy link
Author

PR Ready to merge 🥳

@pcarruscag would you like to do the honours?

@areenraj areenraj merged commit a0ab3da into su2code:develop Jun 24, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:feature GSoC Google Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants