-
Notifications
You must be signed in to change notification settings - Fork 866
[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Created Final Report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
|
||
gpuErrChk(cudaMemcpy((void*)(d_matrix), (void*)&matrix[0], (sizeof(ScalarType)*mat_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_vec), (void*)&vec[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_prod), (void*)&prod[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to copy the product, you just need to memset to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will add this. Thank you
double xDim = (double) 1024.0/(nVar*nEqn); | ||
dim3 blockDim(floor(xDim), nVar, nEqn); | ||
double gridx = (double) nPointDomain/xDim; | ||
dim3 gridDim(ceil(gridx), 1, 1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the choice of work distribution between blocks and threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(int index = d_row_ptr[i]; index<d_row_ptr[i+1]; index++) | ||
{ | ||
int matrix_index = index * nVar * nEqn; | ||
int vec_index = d_col_ind[index] * nEqn; | ||
|
||
res += matrix[matrix_index + (j * nEqn + k)] * vec[vec_index + k]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on some publication? Did you experiment with other divisions of work?
For example, I see you are going for coalesced access to the matrix blocks, but this requires multiple reads of the same vector entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't experimented with different implementations as I went with this because it seemed optimal. It does access the same vector elements repeatedly while going through an entire row.
Haven't particularly checked out publications yet but you're right, there may be a better way to do this and I'll look into them. If you do have some recommendations on improvements then let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the current bottleneck is based on memory copy between the CPU and GPU while the kernel launch itself is 20 times faster than the repeated copy. Any insights on that as well would be very grateful.
Our current approach to circumvent this is to port not only single subroutines like matrix vector multiplication but the entire Krylov Solver loop where it searches over all the search directions in the subspace to the GPU. This would cut down on the repeated memory transfers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do something like a flag in CSysMatrix that is set to true when the matrix is uploaded to GPU and set to false when the matrix changes (for example when we clear the matrix to write new blocks).
You can use pinned host memory to make the transfers faster.
You can try uploading the matrix in chunks and overlap the uploads with the CPU work of filling another chunk.
Ultimately, the issue of transferring the matrix only goes away by porting the entire code 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding recommendations I have to be a bit cryptic because of my current job, but the general goals are coalesced access, and avoid reading or writing the same global memory location more than once.
I read this paper before my current job.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory
Parallel Architectures
But like you said, there is much more to gain by porting more linear algebra operations than to micro-optimize the multiplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm a bit confused here. My entire project and the previous pull request's purpose has been to slowly transfer the FGMRES solver to GPU.
Do you mean that we should just port the helper functions like Matrix Vector Product and other linear algebra functions and let the solver decide based on user given parameters?
Also I'll try to catch a dev meeting so that we can maybe discuss how to move forward from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the same way the FGMRES doesn't need to know how to multiply matrices and vectors, it doesn't need to know anything about GPUs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I will get on this and will also discuss some things I have in mind during the dev meet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't attend the dev meeting today, I suggest you open an issue and that way we can talk about the plan and the GSOC candidate that may work on the project also benefit from us writing everything down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...This should increase performance (although I am still testing it).
NSight shows the new Kernel is twice as fast. Although this doesn't mean anything for overall performance as the bottleneck is still memory transfer.
I could not follow the work here; I just watched the conference presentation. As much work has been done, have you considered using an existing linear algebra library such as https://github.com/ginkgo-project/ginkgo? AFAIK the only thing you need to do is copy the matrix to Gingko format on GPU. Then Gingko will provide an efficient, scalable solver that works not only on NVIDIA but also on AMD and Intel. GMRES and ILU preconditioners are available there, so it is pretty much ready to go for all problems. |
@kursatyurt Hello, thank you so much for the lead. Our initial scope mostly involved writing our own kernels and I did explore some libraries at the start - I was planning on using CUSP as well but my main concern was its lack of being updated to the newly compatible versions of the toolkit. cuSolver and cuBLAS do exist, but I chose to go ahead with a "simple" kernel implementation to have more control. I also felt that if I could keep the block size of the grid in optimal territory then they could be just as fast as those options (please do correct me if my reading of the literature or the situation was incorrect) I was not aware of Ginkgo and I will surely give it a go and try to produce some comparative results. I am currently super busy for this month and will get to working on the code with some delay. Again, thank you for the lead! |
Also could you mention what you meant by "you could not follow the work here" If there is a specific doubt in the work then I would love to clarify it over slack whenever I get the time |
I am not very familiar with the linear solver implementation in SU2 not about the work itself |
To learn the basics, it's a good idea, but for large-scale projects, I prefer using existing libraries if possible. It is kind of light-weight too, not a huge dependency like Trilinos or PETSc.
I can test on various GPUs (P100/V100/A100 and 4070Mobile) on single node multi-gpu etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your idea to return to this project as part of GSOC or just personal interest?
#if defined(HAVE_CUDA) | ||
gpuErrChk(cudaMalloc((void**)(&d_row_ptr), (sizeof(row_ptr) * (nPointDomain + 1.0)))); | ||
gpuErrChk(cudaMalloc((void**)(&d_col_ind), (sizeof(col_ind) * nnz))); | ||
gpuErrChk(cudaMallocHost((void**)(&matrix), (sizeof(ScalarType) * nnz * nVar * nEqn))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the idea with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents cuda Memory functions like Malloc to be called by the compiler when the user doesn't have CUDA. The conditional is only executed when the user specifies that they would like to compile with cuda and have the toolkit along with nvcc installed. If I leave it as is then when compiling with just gcc will throw an exception.
If there is a different/better way of doing this then please do let me know :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about cudaMallocHost instead of cudaMalloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that was done to store the matrix in pinned memory since transfers are faster to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you removed d_matrix, so now the GPU is working directly from host memory which is probably slow.
Also if you read the docs for cudaMallocHost it's not advised to use large chunks of pinned memory, and the matrix is the largest block of memory in SU2...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you removed d_matrix, so now the GPU is working directly from host memory which is probably slow.
My reasoning was that when pinnned memory is allocated on 64 bit OSes, the memory is also simultaneously mapped to the GPU as the conditions for universal virtual addressing are met. This method proves to be faster than the regular MemCpy when the data address is only read/written to once. In our case, the Mat Vec Kernel only read the matrix values once per CudaMemCpy - so I switched it out this way. If my understanding of this was wrong then do let me know.
But now that we will be working on porting the other linear algebra functions it will make more sense to use the CudaMemCpy approach of transferring the pinned memory to the GPU. As the same variables will be accessed multiple times over the course of a single memory transfer
Also if you read the docs for cudaMallocHost it's not advised to use large chunks of pinned memory, and the matrix is the largest block of memory in SU2
i never ran into a problem with this as the cases I ran didn't take a performance hit. But I can understand that with increase in problem size the pinned memory could decrease the total available system memory and bottleneck the performance of the solver. Would you recommend using pinned memory for the input and output vectors then?
@pcarruscag No, I won't be applying to GSoC any time in the future. Its just that I started a work and I would like to see this through. If anyone is selected for GPU Acceleration this time then I would be more than happy to work and collaborate with them to make sure the project receives as much help as it can. |
…ansfer and standardized file structure
29 Mar 2025 CommitsThe commits currently made aim to provide more control over memory access and transfer so that we can make further changes to the surrounding linear algebra functions which make up the FGMRES Solver. Streamlined Vector Storage in the GPUEach vector definition now also creates its corresponding analog in the GPU Memory space. This is done by allocating memory for it using cudaMalloc when the Initialize function of the CSysVector Class is called. This allows for the continuous storage of vector data in the GPU memory in between calls of the Matrix Vector Product and other linear algebra functions. The previous implementation only allowed this data to persist during a single call of the Matrix Vector Product and it had to be refreshed for each call. This implementation was similar to how the matrix was stored previously. Saving the matrix in pinned memory is also removed due to its huge size as pointed out by earlier feedback. The device pointer can be accessed at any point using a new dedicated public member function (GetDevicePointer) of the CsysVector Class. Added Memory Transfer ControlAs previously discussed, we needed more control over the memory transfer in between calls so that a larger load of the computation could be carried out on the GPU without multiple memory transfers. Now these transfers are carried out by member functions with a flag built into them to decide whether the copy needs to be carried out or not. This flag is set to true by default and does not need to be specified all the time. Further changes are necessary to actually use this flag to decrease the frequency of memory transfer - namely a variable that allows the inner loop of FGMRES to communicate with the MatrixVectorProduct function to know when to switch the flag on or off. This will be added after I port the preconditioner. Minor Change - Standardized File Structure SlightlyRedundant .cuh header files are now gone. I have added a GPUComms.cuh file so that any functions that need to be accessed by all CUDA files - like error checking - can be added here for future references. I've also added GPUVector and GPUMatrix files here - each containing the cuda wrapping member functions for the CSysVector and CSysMatrix class respectively. Please let me know if you notice any bugs or if my implementations can be improved in any way. |
I'll now move forward and start porting the LU_SGS Preconditioner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to CSysVectorGPU or something similar, this implements functions of CSysVector so it makes more sense to name it consistently with the CPU version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the merge conflicts so the regression tests can run in this PR?
std::cerr << "\nError in launching Matrix-Vector Product Function\n"; | ||
std::cerr << "ENABLE_CUDA is set to YES\n"; | ||
std::cerr << "Please compile with CUDA options enabled in Meson to access GPU Functions" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SU2_MPI::Error(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed, however in the gpuErrChk function I have left it as fprintf since the the string has some placeholders to it. I could use a temporary character array along with SPRINTF to use SU2_MPI::Error. But I am not sure whether allocating and de-allocating the temporary character array repeatedly will slow down performance or not. Since, gpuErrChk is called every time a CUDA API function is called.
|
||
void FGMRESMainLoop(std::vector<ScalarType> W, std::vector<ScalarType> Z, su2vector<ScalarType>& g, | ||
su2vector<ScalarType>& sn, CSysVector<ScalarType>& cs, su2vector<ScalarType>& y, | ||
su2vector<ScalarType>& H, int m, CGeometry* geometry, const CConfig* config) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unused
void FGMRESMainLoop(std::vector<ScalarType> W, std::vector<ScalarType> Z, su2vector<ScalarType>& g, | |
su2vector<ScalarType>& sn, CSysVector<ScalarType>& cs, su2vector<ScalarType>& y, | |
su2vector<ScalarType>& H, int m, CGeometry* geometry, const CConfig* config) const; |
if (vec_val == nullptr) vec_val = MemoryAllocation::aligned_alloc<ScalarType, true>(64, nElm * sizeof(ScalarType)); | ||
|
||
#if defined(HAVE_CUDA) | ||
gpuErrChk(cudaMalloc((void**)(&d_vec_val), (sizeof(ScalarType) * nElm))); | ||
gpuErrChk(cudaMemset((void*)d_vec_val, 0.0, (sizeof(ScalarType) * nElm))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to how we have MemoryAllocation::aligned_alloc
and free, please create wrappers for the cuda-specific functions so that we avoid #ifdef HAVE_CUDA in many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please let me know if my handling of null pointers was accurate
int row = (blockIdx.x * blockDim.x + threadIdx.x)/32; | ||
int threadNo = threadIdx.x%32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define 32 (threads per warp) as a constant somewhere.
dim3 blockDim(1024,1,1); | ||
double gridx = (double) nPointDomain/32.0; | ||
gridx = double(ceil(gridx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have functions to round up integer division in the memory allocation toolbox.
Define 1024 as the default block size somewhere too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used a slight variation of the function in the Allocation Toolbox. Similar syntax however.
It shows that merging is blocked (I guess a review is required?) Also please let me know how you wish for me to keep the fork synced with changes in the original - either through merge or rebase. |
Proposed Changes
This is the modified version of SU2 code that supports CUDA usage for the FGMRES solver and the use of NVBLAS. The main focus is the offloading of the Matrix Vector Product in the FGMRES solver to the GPU using CUDA Kernels. This implementation shows promise with marginally better run times (all benchmarks were carried out with the GPU Error Checking switched off and in debug mode to check if the correct functions were being called).
The use of NVBLAS is secondary and while functionality has been added to make it usable, it is not activated as it doesn't cause an appreciative increase in performance.
Compilation and Usage
Compile using the following MESON Flag
And activate the functions using the following Config File Option
NOTE ON IMPLEMENTATION
I've decided to go with a single version of the code where the CPU and GPU implementations co-exist in the same linear solver and can be disabled or switched using a combination of Meson and Config File options. This is why I have defined three classes - one over-arching class that is named CExecutionPath that has two child classes - CCpuExecution and CGpuExecution. These child classes contain the correct member function for each path - CPU or GPU functioning.
All of this could also be easily achieved with an if statement that switches between the two - but that particular implementation will access and run the statement for each call. In our case once a Matrix Vector Product object is created, it will immediately know whether to use CPU or GPU mode of execution.
Recommendations are most welcome to improve or make this implementation better
PR Checklist
Warning Levels do come (only at level 3) but they are all of the following type
The documentation for compiling with CUDA needs to be added by forking the SU2 site repo and adding the relevant changes to it? Or do I need to contact someone to change things on the site itself?
DOxygen Documentations and config template are all updated.
pre-commit run --all
to format old commits.