feat: GPU locking via atomic operation using POSIX shared memory#54
Conversation
f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/cudaFortMap.f90 f90/3D_MT/FWD_SP2/hipFortMap.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip Replace the heuristic flag-based GPU device contention mechanism with a proper atomic lock using POSIX shared memory (shm_open/mmap). Multiple MPI ranks can now safely target and share the same GPU: - cf_hookDev() acquires GPU via CAS 0->1 on an atomic flag array - cf_releaseDev() releases the lock so other processes can attach - cf_cleanupLock() unmaps and unlinks shared memory on exit - Added corresponding Fortran C-bindings for both CUDA and HIP - Call cf_releaseDev() from EMsolve3D.f90 after solver completion also: - Renamed hipBLAS bindings to remove deprecated _v2 suffix - Renamed local sleep() to isleep() to avoid POSIX name collision
There was a problem hiding this comment.
Pull request overview
This PR replaces the prior heuristic GPU device “flag” contention approach with a cross-process GPU lock implemented via POSIX shared memory and atomic operations, and wires lock release into the Fortran solver flow to reduce multi-rank GPU contention issues.
Changes:
- Added a POSIX shared-memory based GPU lock (CAS 0→1) for CUDA and HIP device attachment.
- Added Fortran C-bindings for releasing the device lock (and a cleanup API), and release calls after solver completion.
- Updated HIPBLAS bindings to remove deprecated
_v2suffix for several routines; renamed localsleep()helper toisleep()to avoid POSIX name collision.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| f90/3D_MT/FWD_SP2/kernel_c.cu | Introduces CUDA shared-memory GPU lock and renames local sleep helper. |
| f90/3D_MT/FWD_SP2/kernel_c.hip | Introduces HIP shared-memory GPU lock and renames local sleep helper. |
| f90/3D_MT/FWD_SP2/EMsolve3D.f90 | Releases GPU lock after solver completion (CUDA/HIP paths). |
| f90/3D_MT/FWD_SP2/cudaFortMap.f90 | Adds Fortran interfaces for cf_releaseDev and cf_cleanupLock. |
| f90/3D_MT/FWD_SP2/hipFortMap.f90 | Updates HIPBLAS symbol bindings and adds interfaces for cf_releaseDev / cf_cleanupLock. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| // else let it spin | ||
| sleep(0.02); | ||
| sleep(0.05); |
| // Ensure all flags are zero on first creation. Only one process does this. | ||
| static std::atomic<int> init_guard(0); | ||
| int expected = 0; | ||
| if (init_guard.compare_exchange_strong(expected, 1)) { | ||
| for (int i = 0; i < LOCK_MAX_DEVICES; i++) | ||
| g_lock->occupied[i].store(0, std::memory_order_relaxed); | ||
| } |
There was a problem hiding this comment.
I think Copilot is right here and that init_guard is per-process and so every process is writing over the lock. I think we would need to also use a shared memory map to lock it properly.
There was a problem hiding this comment.
indeed - I am trying to figure out how to implement a shared memory.
| static constexpr int LOCK_MAX_DEVICES = 64; | ||
|
|
||
| struct alignas(64) GpuLock { | ||
| std::atomic<int> occupied[LOCK_MAX_DEVICES]; | ||
| }; |
| // Atomic lock loop: CAS 0->1 to claim the GPU | ||
| while (true) | ||
| { | ||
| // see if this device is available | ||
| err = cudaMemGetInfo(&freeBytes, &totalBytes); | ||
| if (err != cudaSuccess) | ||
| { | ||
| printf("Failed to get usage for device %i: %s\n", dev_idx, | ||
| cudaGetErrorString(err)); | ||
| goto Error; | ||
| } | ||
| usedBytes = totalBytes - freeBytes; | ||
| memusage = 100.0*usedBytes/totalBytes; | ||
| // now get some device flags | ||
| err = cudaGetDeviceFlags(&flag); | ||
| if (err != cudaSuccess) | ||
| { | ||
| printf("Failed to get the device flags %i: %s\n", dev_idx, | ||
| cudaGetErrorString(err)); | ||
| } | ||
| // printf("Dev = %i, flag = %i \n", dev_idx, flag); | ||
| if (flag != cudaDeviceScheduleYield && memusage < 50) | ||
| // if (memusage < 50) | ||
| // if (flag != cudaDeviceScheduleYield) | ||
| { | ||
| // If the number of CUDA contexts is greater than the number of | ||
| // logical processors in the system, use Spin scheduling. | ||
| // Else use Yield scheduling. | ||
| err = cudaSetDeviceFlags(cudaDeviceScheduleYield); | ||
| int expected = 0; | ||
| if (g_lock->occupied[dev_idx].compare_exchange_strong( | ||
| expected, 1, std::memory_order_acq_rel)) | ||
| { |
There was a problem hiding this comment.
OK - this is easy to fix (same for the hip part).
| static std::atomic<int> init_guard(0); | ||
| int expected = 0; | ||
| if (init_guard.compare_exchange_strong(expected, 1)) { | ||
| for (int i = 0; i < LOCK_MAX_DEVICES; i++) | ||
| g_lock->occupied[i].store(0, std::memory_order_relaxed); | ||
| } |
| static constexpr int LOCK_MAX_DEVICES = 64; | ||
|
|
||
| struct alignas(64) GpuLock { | ||
| std::atomic<int> occupied[LOCK_MAX_DEVICES]; | ||
| }; |
| // now try to get affiliated device | ||
| while(true) | ||
| // Atomic lock loop: CAS 0->1 to claim the GPU | ||
| while (true) | ||
| { | ||
| // try to get some info for the device | ||
| err = hipMemGetInfo(&freeBytes, &totalBytes); | ||
| if (err != hipSuccess) | ||
| { | ||
| printf("Failed to get usage for device %i: %s\n", dev_idx, | ||
| hipGetErrorString(err)); | ||
| goto Error; | ||
| } | ||
| usedBytes = totalBytes - freeBytes; | ||
| memusage = 100.0*usedBytes/totalBytes; | ||
| // now get some device flags | ||
| err = hipGetDeviceFlags(&flag); | ||
| if (err != hipSuccess) | ||
| int expected = 0; | ||
| if (g_lock->occupied[dev_idx].compare_exchange_strong( | ||
| expected, 1, std::memory_order_acq_rel)) |
| ! Release GPU lock so other processes can hook on | ||
| #if defined(CUDA) || defined(HIP) | ||
| if (device_id >= 0) then | ||
| call cf_releaseDev(device_id) |
There was a problem hiding this comment.
Well, NO- resetting this (while finishing one solve task for a transmitter) will reset the shared lock for the other processes - will definitely cause some (undefined) problems.
f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip - Race condition: process-local init guard Previously every MPI rank entered the loop may destroy locks held by other ranks and allowing simultaneous GPU attachment (for the first time). Now replaced per-process static std::atomic<int> init_guard(0) with a shared-memory initialized field inside GpuLock. The first process to CAS initialized 0-->1, subsequent processes should skip. - Undefined behavior: std::atomic on unmapped raw storage mmap() returns raw bytes; std::atomic<int> constructors were never run. Object lifetimes now properly begin via placement-new, coordinated by a single unsigned char constructed flag protected by __atomic_test_and_set (ACQ_REL). This flag has no lifetime requirements per the C++ standard, making the entire initialization sequence UB-free. Added #include <new>. - dev_idx not validated before lock access cf_hookDev() previously checked only dev_idx + 1 > nDevices. Negative indices (e.g. -1 → 0) and values exceeding LOCK_MAX_DEVICES were accepted, leading to OOB writes into the occupied[] array. Now validates: dev_idx < 0 || dev_idx >= nDevices || dev_idx >= LOCK_MAX_DEVICES before any lock or CUDA/HIP access. - cf_cleanupLock() never invoked in Added cf_cleanupLock() calls at both solver exit points (FWDsolve3D, FWDsolve3Dfg). The fine-grained path calls cleanup after the MPI barrier on rank 0 only (idempotent). Previously the POSIX shared memory segment /ModEM_gpu_lock persisted across program runs, causing possible problems to operate on stale :wqlock state.
MiCurry
left a comment
There was a problem hiding this comment.
Hope you don't mind me taking a peak and providing some comments.
Another suggestion would be to move the locking mechanism into a separate file (gpu_locking.h?) and then allowing the HIP and CUDA code can import that file and use the routines. That would eliminate the need to duplicate the code and make development easier for the locking routines.
Any thoughts on using pthread mutexes with PTHREAD_PROCESS_SHARED?
| const char* name = "/ModEM_gpu_lock"; | ||
|
|
||
| // Try to open existing shared memory segment first | ||
| int fd = shm_open(name, O_RDWR, 0666); |
There was a problem hiding this comment.
Should we use 0600 here instead of 0666? 0666 gives RW to users, group and others. Is group and other access necessary? Might be better to just do 0600 for RW for the user.
There was a problem hiding this comment.
Thanks for the comments! 0600 is surely better (just wrote 666 for some quick testing). Putting the lock mechanism into a separate header file is also a nice idea (currently just writing the same lines twice for both HIP and CUDA...). Useful, especially considering that we may want to expand to other interfaces. Not sure if INTEL cards will die out again - but there are some Chinese cards emerging, as well.
On the other hand, pthread mutexes should indeed be useful for inter-process synchronization - but we may not need to add more complexity to the C part (for the current stage).
|
|
||
| // Ensure all flags are zero on first creation. Only one process does this. | ||
| static std::atomic<int> init_guard(0); | ||
| int expected = 0; |
There was a problem hiding this comment.
What if we use C/C++ macros for 0 and 1 here?
#DEFINE DEVICE_IN_USE 1
#DEFINE DEVICE_FREE 0Then we could do:
if (init_guard.compare_exchange_strong(DEVICE_FREE, DEVICE_IN_USE)) {
for (int i = 0; i < LOCK_MAX_DEVICES; i++)
g_lock->occupied[i].store(DEVICE_FREE, std::memory_order_relaxed);
}
}There was a problem hiding this comment.
thanks @MiCurry indeed - that would add readability.
| // Ensure all flags are zero on first creation. Only one process does this. | ||
| static std::atomic<int> init_guard(0); | ||
| int expected = 0; | ||
| if (init_guard.compare_exchange_strong(expected, 1)) { | ||
| for (int i = 0; i < LOCK_MAX_DEVICES; i++) | ||
| g_lock->occupied[i].store(0, std::memory_order_relaxed); | ||
| } |
There was a problem hiding this comment.
I think Copilot is right here and that init_guard is per-process and so every process is writing over the lock. I think we would need to also use a shared memory map to lock it properly.
|
|
||
| g_lock = static_cast<GpuLock*>(mmap(nullptr, sizeof(GpuLock), | ||
| PROT_READ | PROT_WRITE, | ||
| MAP_SHARED, fd, 0)); |
There was a problem hiding this comment.
Re Copilot's comment here https://github.com/magnetotellurics/ModEM/pull/54/changes#r3209269177.
I think it's suggesting that we do something like:
void* shm_ptr = mmap(nullptr, sizeof(GpuLock),
PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
GpuLock* lock = new (shm_ptr) GpuLock();
lock->occupied.store(0, std::memory_order_relaxed);Otherwise I think the atomic isn't fully created and might be lacking it's internal fields for insuring locking? I could be wrong though 🤔
There was a problem hiding this comment.
You are right - I suppose per CPP standard the current code works (it seems) only because std::atomic has a trivial layout - But of course it's not guaranteed to... Now added flag (cnstr) for the code to coordinate an exactly-once construction of std::atomic. Yes, probably should not really rely on the undefined behaviors...
f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip f90/3D_MT/FWD_SP2/gpu_lock.h Move the platform-independent shared-memory lock infrastructure (GpuLock struct, init_gpu_lock, cf_releaseDev, cf_cleanupLock) from kernel_c.cu and kernel_c.hip into the new `gpu_lock.h` header. Both CUDA and HIP kernels now include this header, eliminating some duplicated codes. Also fix the problem of the previous - New file: f90/3D_MT/FWD_SP2/gpu_lock.h (header-only lock impl) - kernel_c.cu: removed duplicated lock code; kept CUDA-specific cf_hookDev() - kernel_c.hip: removed duplicated lock code; kept HIP-specific cf_hookDev() - Removed redundant includes now provided by gpu_lock.h Also: - added a flag to coordinate an exactly-once construction of std::atomic - remove premature cf_cleanupLock() calls from EMsolve3D.f90. - Replaced magic numbers 0/1 with DEVICE_FREE / DEVICE_IN_USE macros - Restricted shm_open permissions from 0666 to 0600 (owner-only)
f90/3D_MT/FWD_SP2/gpu_lock.h - Replace __atomic_test_and_set with __atomic_exchange_n on `cnstr`, combining winner-selection and flag-setting into one operation. - cnstr now serves as both the raw-storage constructed flag and the cross-process init guard.
f90/3D_MT/FWD_SP2/EMsolve3D.f90
f90/3D_MT/FWD_SP2/cudaFortMap.f90
f90/3D_MT/FWD_SP2/hipFortMap.f90
f90/3D_MT/FWD_SP2/kernel_c.cu
f90/3D_MT/FWD_SP2/kernel_c.hip
Replace the heuristic flag-based GPU device contention mechanism with a (slightly) more proper atomic lock using POSIX shared memory (shm_open/mmap). The previous method has a potential risk of getting into a race condition (multiple processes hook onto a same GPU device at the same time). While this is fine with modern GPUs (have the 'MPS" feature and allow for multiple processes on one physical GPU), this could cause a problem when multiple processes allocate large chunks of VRAM at the same time, leading to out-of-memory errors.
For now, multiple MPI ranks should be able to safely target and share a same GPU:
also:
Tested on both CUDA (12.4) and HIP (7.2.0) environments.