[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

areenraj · 2024-08-27T10:01:24Z

Proposed Changes

This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).

The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.

Compilation and Usage

Compile using the following MESON Flag

-Denable-cuda=true

And activate the functions using the following Config File Option

ENABLE_CUDA=YES

NOTE ON IMPLEMENTATION

I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.

All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.

Recommendations are most welcome to improve or make this implementation better

PR Checklist

Warning Levels do come (only at level 3) but they are all of the following type

style of line directive is a GCC extension

The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?

DOxygen Documentations and config template are all updated.

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

…on as well

Created Final Report

kursatyurt · 2024-10-07T20:12:53Z

@kursatyurt Hello, thank you so much for the lead.

Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect)

To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible.
Those libraries generally exploit state-of-the-art solution like mixed-precision computing. A gaming GPU is not way faster than a good CPU in double precision, but way faster in single precision, most of them have 64:1 ratio, however server class GPU have 2:1 ratio. Also when available they use vendor libraries like cuBLAS or hipBLAS. It is always nice to have you only care about connection and somebody else handle the solver as performant as possible. In future probably they will provide more and more solvers and it will be automagically works.

It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.

I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay.

Again, thank you for the lead!

I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc.

pcarruscag

Is your idea to return to this project as part of GSOC or just personal interest?

Common/src/linear_algebra/CSysMatrix.cpp

areenraj · 2025-03-17T15:45:50Z

Is your idea to return to this project as part of GSOC or just personal interest?

@pcarruscag No, I won't be applying to GSoC any time in the future. Its just that I started a work and I would like to see this through. If anyone is selected for GPU Acceleration this time then I would be more than happy to work and collaborate with them to make sure the project receives as much help as it can.

…ansfer and standardized file structure

areenraj · 2025-03-29T17:27:51Z

29 Mar 2025 Commits

The commits currently made aim to provide more control over memory access and transfer so that we can make further changes to the surrounding linear algebra functions which make up the FGMRES Solver.

Streamlined Vector Storage in the GPU

Each vector definition now also creates its corresponding analog in the GPU Memory space. This is done by allocating memory for it using cudaMalloc when the Initialize function of the CSysVector Class is called. This allows for the continuous storage of vector data in the GPU memory in between calls of the Matrix Vector Product and other linear algebra functions. The previous implementation only allowed this data to persist during a single call of the Matrix Vector Product and it had to be refreshed for each call.

This implementation was similar to how the matrix was stored previously. Saving the matrix in pinned memory is also removed due to its huge size as pointed out by earlier feedback. The device pointer can be accessed at any point using a new dedicated public member function (GetDevicePointer) of the CsysVector Class.

Added Memory Transfer Control

As previously discussed, we needed more control over the memory transfer in between calls so that a larger load of the computation could be carried out on the GPU without multiple memory transfers. Now these transfers are carried out by member functions with a flag built into them to decide whether the copy needs to be carried out or not. This flag is set to true by default and does not need to be specified all the time.

Further changes are necessary to actually use this flag to decrease the frequency of memory transfer - namely a variable that allows the inner loop of FGMRES to communicate with the MatrixVectorProduct function to know when to switch the flag on or off. This will be added after I port the preconditioner.

Minor Change - Standardized File Structure Slightly

Redundant .cuh header files are now gone. I have added a GPUComms.cuh file so that any functions that need to be accessed by all CUDA files - like error checking - can be added here for future references. I've also added GPUVector and GPUMatrix files here - each containing the cuda wrapping member functions for the CSysVector and CSysMatrix class respectively.

Please let me know if you notice any bugs or if my implementations can be improved in any way.

areenraj · 2025-03-29T17:28:36Z

I'll now move forward and start porting the LU_SGS Preconditioner

Common/src/linear_algebra/GPUVector.cu

pcarruscag

Can you fix the merge conflicts so the regression tests can run in this PR?

Common/include/linear_algebra/CMatrixVectorProduct.hpp

Common/include/linear_algebra/CSysMatrix.hpp

Common/src/linear_algebra/CSysVector.cpp

Common/src/linear_algebra/GPUMatrix.cu

update master

…k Sizes

areenraj · 2025-05-08T15:32:11Z

Can you fix the merge conflicts so the regression tests can run in this PR?

It shows that merging is blocked (I guess a review is required?) Also please let me know how you wish for me to keep the fork synced with changes in the original - either through merge or rebase.

pcarruscag

👍 address the minor comments + the version updates we just talked about and let's merge 🥳

Common/include/linear_algebra/CMatrixVectorProduct.hpp

Common/include/linear_algebra/CSysMatrix.hpp

Common/include/linear_algebra/GPUComms.cuh

Common/include/toolboxes/allocation_toolbox.hpp

TestCases/gpu/flatplate/lam_flatplate.cfg

…ocation

areenraj · 2025-06-23T08:59:47Z

PR Ready to merge 🥳

@pcarruscag would you like to do the honours?

areenraj added 30 commits July 26, 2024 17:53

refresh everything

34c0bda

readme update

c953af8

final push

231968d

Enable GPU Mat Vec

4bddc30

New Branch and Optimized Memory Alloc on GPU Slightly

2da57f6

Finished GPU Mat-Vec with CPU Accuracy and Block Matrix Parallelizati…

aa35779

…on as well

Added fully template kernels to prevent type errors

8866936

Disabled NVBLAS Implementation in the DG Solver

12f01db

Added options and error check

d7cbd5e

reverting stuff to debug

103ec39

fixed turbulent case but Error Check is a performance hit

38d658b

Updated README for final report

02b9eb8

readme update

1a902ae

readme update

3fa00a9

readme update

57ffb74

Final Graph Changes

b8f14d3

Image Changes

3a000e2

Added Runtime Polymorphism for selecting execution Path

c31b9df

Added Runtime Polymorphism to select between CPU and GPU Execution

e131308

Added runtime polymorphism to select execution path

7695314

Added runtime polymorphism to select execution path

809c2d0

Added Runtime Polymorphism to select between CPU and GPU Execution

9832f33

Added Runtime Polymorphism to select between CPU and GPU Execution

1f41592

Create REPORT.md

de0810a

Created Final Report

Delete REPORT.md

66878d1

Added Preprocessor Directives

729bfc8

Making Repo PR Ready

3f351db

Making Repo PR Ready

f363809

Making it PR Ready

a0e09d7

Pre-Commit Hook Ran

b489d09

areenraj added 3 commits March 17, 2025 13:32

New Algorithm for MVP

1cc128d

Merge remote-tracking branch 'upstream/master' into HEAD

9fb99b2

Version Number Change

86e59b4

pcarruscag reviewed Mar 17, 2025

View reviewed changes

Common/src/linear_algebra/CSysMatrix.cpp Outdated Show resolved Hide resolved

Streamlined Vector GPU Storage, Introduced flags to control memory tr…

c7777c8

…ansfer and standardized file structure

pcarruscag reviewed Mar 30, 2025

View reviewed changes

Common/src/linear_algebra/GPUVector.cu Outdated Show resolved Hide resolved

pcarruscag reviewed Mar 30, 2025

View reviewed changes

pcarruscag and others added 4 commits May 4, 2025 16:27

Merge pull request su2code#2499 from su2code/develop

5d71ae2

update master

Introduced Memory Wrapper Functions, Error Handling and Constant Bloc…

b7ed530

…k Sizes

Syncing Repo using Merge

625c483

Minor changes

70a7680

areenraj added 4 commits June 9, 2025 22:40

Merge remote-tracking branch 'upstream/develop'

0fcf10d

synced to develop and removed redundant NVBLAS components

0b0d2df

pre-commit hook ran

5509406

fixed compilation issues on CPU-only mode

d6b3216

areenraj added the GSoC Google Summer of Code label Jun 9, 2025

areenraj added 2 commits June 10, 2025 00:27

fixed some more compilation issues

e12f281

should fix MPI compilation issues

ebd1bae

areenraj added the changelog:feature label Jun 9, 2025

pcarruscag approved these changes Jun 13, 2025

View reviewed changes

areenraj added 3 commits June 23, 2025 09:27

Minor Changes and Version Number Fixes

c408c5b

Pre-Commit Hook Done, Ready to Merge :)

a37dff4

Fixed memory exceptions related to matrix allocation and vector deall…

1a2b50b

…ocation

areenraj merged commit a0ab3da into su2code:develop Jun 24, 2025
35 checks passed

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346

Uh oh!

Conversation

areenraj commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Compilation and Usage

NOTE ON IMPLEMENTATION

PR Checklist

Uh oh!

kursatyurt commented Oct 7, 2024

Uh oh!

pcarruscag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

areenraj commented Mar 17, 2025

Uh oh!

areenraj commented Mar 29, 2025

29 Mar 2025 Commits

Streamlined Vector Storage in the GPU

Added Memory Transfer Control

Minor Change - Standardized File Structure Slightly

Uh oh!

areenraj commented Mar 29, 2025

Uh oh!

Uh oh!

pcarruscag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

areenraj commented May 8, 2025

Uh oh!

pcarruscag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

areenraj commented Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

areenraj commented Aug 27, 2024 •

edited

Loading