-
Notifications
You must be signed in to change notification settings - Fork 428
Add amrex::Gpu::freeAsync #4804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add amrex::Gpu::freeAsync #4804
Conversation
|
/run-hpsf-gitlab-ci |
|
GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1319922. |
|
GitLab CI 1319922 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1319922. |
|
Maybe this should be named asyncFree or freeAsync instead. |
|
I like |
|
For a case with very few particles, this helps to significantly improve the performance of hipace shiftSlippedParticles (which calls AMReX dev, 79 µs on the GPU:
Using this PR, 59 µs on the GPU:
|
|
The PR LGTM. I will merge it after the 25.12 release. @AlexanderSinn Something for the future. Maybe we could use this to replace the CUDA stream ordered memory allocator in the implementation of The_Async_Arena(). |
|
Eventually an allocAsync could be added that can use the memory in m_free_wait_list. This will have stricter usage requirements, mainly that the memory can only be accessed in stream order and not by the host. Additionally, we would need to get the capacity of an allocation from the arena and store it in m_free_wait_list. I think ultimately The_Async_Arena should use both freeAsync and allocAsync so that in loops that allocate and free a lot with no sync, it can reuse memory effectively. I think this might be very similar to what cudaMallocAsync is doing internally, just with a cross-platform implementation, without the overhead of calling the CUDA API and using The_Arena as backing. Should I change The_Async_Arena to use freeAsync or add allocAsync first? Or do both at the same time? |
|
Let's wait until this PR is merged. I am actually thinking of something simpler. Create a new Arena derived class for |
|
Yes, this will be in a new PR. The_Async_Arena uses PArena which is already separate. Should PArena be changed to use freeAsync or kept as-is but without an interface? Do we need to keep track of the stream index at allocation? I would have just used whatever stream is active when the memory is freed. I can't think of a use case that would have different streams active between alloc and free, so I don't know which version is preferable. |
|
I think we should leave PArena separated and create a new class. Yes, we probably don't even need to have a map storing the stream index during allocation. That would be even simpler. We can also think about maybe when CArena is about to run out of memory, we could try to immediately free the memory freed by freeAsync. |


Summary
This PR adds the function
amrex::Gpu::streamFree (Arena* arena, void* mem)that can be used to free memory the next time the current GPU stream is synchronized.This is based on #4432 but with much reduced complexity from OMP.
The interface is now opt-in and always available, instead of needing to be enabled using runtime parameters.
Additional background
Checklist
The proposed changes: