feat: GPU locking via atomic operation using POSIX shared memory by dong-hao · Pull Request #54 · magnetotellurics/ModEM

dong-hao · 2026-05-07T07:52:44Z

f90/3D_MT/FWD_SP2/EMsolve3D.f90
f90/3D_MT/FWD_SP2/cudaFortMap.f90
f90/3D_MT/FWD_SP2/hipFortMap.f90
f90/3D_MT/FWD_SP2/kernel_c.cu
f90/3D_MT/FWD_SP2/kernel_c.hip

Replace the heuristic flag-based GPU device contention mechanism with a (slightly) more proper atomic lock using POSIX shared memory (shm_open/mmap). The previous method has a potential risk of getting into a race condition (multiple processes hook onto a same GPU device at the same time). While this is fine with modern GPUs (have the 'MPS" feature and allow for multiple processes on one physical GPU), this could cause a problem when multiple processes allocate large chunks of VRAM at the same time, leading to out-of-memory errors.

For now, multiple MPI ranks should be able to safely target and share a same GPU:

cf_hookDev() acquires GPU via CAS 0->1 on an atomic flag array
cf_releaseDev() releases the lock so other processes can attach
cf_cleanupLock() unmaps and unlinks shared memory on exit
Added corresponding Fortran C-bindings for both CUDA and HIP
Call cf_releaseDev() from EMsolve3D.f90 after solver completion

also:

Renamed hipBLAS bindings to remove deprecated _v2 suffix
Renamed local sleep() to isleep() to avoid POSIX name collision

Tested on both CUDA (12.4) and HIP (7.2.0) environments.

f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/cudaFortMap.f90 f90/3D_MT/FWD_SP2/hipFortMap.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip Replace the heuristic flag-based GPU device contention mechanism with a proper atomic lock using POSIX shared memory (shm_open/mmap). Multiple MPI ranks can now safely target and share the same GPU: - cf_hookDev() acquires GPU via CAS 0->1 on an atomic flag array - cf_releaseDev() releases the lock so other processes can attach - cf_cleanupLock() unmaps and unlinks shared memory on exit - Added corresponding Fortran C-bindings for both CUDA and HIP - Call cf_releaseDev() from EMsolve3D.f90 after solver completion also: - Renamed hipBLAS bindings to remove deprecated _v2 suffix - Renamed local sleep() to isleep() to avoid POSIX name collision

Copilot

Pull request overview

This PR replaces the prior heuristic GPU device “flag” contention approach with a cross-process GPU lock implemented via POSIX shared memory and atomic operations, and wires lock release into the Fortran solver flow to reduce multi-rank GPU contention issues.

Changes:

Added a POSIX shared-memory based GPU lock (CAS 0→1) for CUDA and HIP device attachment.
Added Fortran C-bindings for releasing the device lock (and a cleanup API), and release calls after solver completion.
Updated HIPBLAS bindings to remove deprecated _v2 suffix for several routines; renamed local sleep() helper to isleep() to avoid POSIX name collision.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
f90/3D_MT/FWD_SP2/kernel_c.cu	Introduces CUDA shared-memory GPU lock and renames local sleep helper.
f90/3D_MT/FWD_SP2/kernel_c.hip	Introduces HIP shared-memory GPU lock and renames local sleep helper.
f90/3D_MT/FWD_SP2/EMsolve3D.f90	Releases GPU lock after solver completion (CUDA/HIP paths).
f90/3D_MT/FWD_SP2/cudaFortMap.f90	Adds Fortran interfaces for `cf_releaseDev` and `cf_cleanupLock`.
f90/3D_MT/FWD_SP2/hipFortMap.f90	Updates HIPBLAS symbol bindings and adds interfaces for `cf_releaseDev` / `cf_cleanupLock`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        }
        // else let it spin
-        sleep(0.02);
+        sleep(0.05);


MiCurry · 2026-05-08T21:03:58Z

+    // Ensure all flags are zero on first creation. Only one process does this.
+    static std::atomic<int> init_guard(0);
+    int expected = 0;
+    if (init_guard.compare_exchange_strong(expected, 1)) {
+        for (int i = 0; i < LOCK_MAX_DEVICES; i++)
+            g_lock->occupied[i].store(0, std::memory_order_relaxed);
+    }


I think Copilot is right here and that init_guard is per-process and so every process is writing over the lock. I think we would need to also use a shared memory map to lock it properly.

indeed - I am trying to figure out how to implement a shared memory.

+static constexpr int LOCK_MAX_DEVICES = 64;
+
+struct alignas(64) GpuLock {
+    std::atomic<int> occupied[LOCK_MAX_DEVICES];
+};


dong-hao · 2026-05-09T05:40:13Z

+    // Atomic lock loop: CAS 0->1 to claim the GPU
+    while (true)
    {
-        // see if this device is available
-        err =  cudaMemGetInfo(&freeBytes, &totalBytes);
-        if (err != cudaSuccess)
-        {
-            printf("Failed to get usage for device %i: %s\n", dev_idx, 
-                cudaGetErrorString(err));
-            goto Error;
-        }
-        usedBytes = totalBytes - freeBytes;
-        memusage = 100.0*usedBytes/totalBytes;
-        // now get some device flags
-        err = cudaGetDeviceFlags(&flag);
-        if (err != cudaSuccess)
-        {
-            printf("Failed to get the device flags %i: %s\n", dev_idx, 
-               cudaGetErrorString(err));
-        }
-        // printf("Dev = %i, flag = %i \n", dev_idx, flag);
-        if (flag != cudaDeviceScheduleYield && memusage < 50)
-        // if (memusage < 50)
-        // if (flag != cudaDeviceScheduleYield)
-        {
-            // If the number of CUDA contexts is greater than the number of 
-            // logical processors in the system, use Spin scheduling. 
-            // Else use Yield scheduling.
-            err = cudaSetDeviceFlags(cudaDeviceScheduleYield);
+        int expected = 0;
+        if (g_lock->occupied[dev_idx].compare_exchange_strong(
+                expected, 1, std::memory_order_acq_rel))
+	{


OK - this is easy to fix (same for the hip part).

+    static std::atomic<int> init_guard(0);
+    int expected = 0;
+    if (init_guard.compare_exchange_strong(expected, 1)) {
+        for (int i = 0; i < LOCK_MAX_DEVICES; i++)
+            g_lock->occupied[i].store(0, std::memory_order_relaxed);
+    }


+static constexpr int LOCK_MAX_DEVICES = 64;
+
+struct alignas(64) GpuLock {
+    std::atomic<int> occupied[LOCK_MAX_DEVICES];
+};


    // now try to get affiliated device
-    while(true)
+    // Atomic lock loop: CAS 0->1 to claim the GPU
+    while (true)
    {
-        // try to get some info for the device
-        err =  hipMemGetInfo(&freeBytes, &totalBytes);
-        if (err != hipSuccess)
-        {
-            printf("Failed to get usage for device %i: %s\n", dev_idx, 
-                hipGetErrorString(err));
-            goto Error;
-        }
-        usedBytes = totalBytes - freeBytes;
-        memusage = 100.0*usedBytes/totalBytes;
-        // now get some device flags
-        err = hipGetDeviceFlags(&flag);
-        if (err != hipSuccess)
+        int expected = 0;
+        if (g_lock->occupied[dev_idx].compare_exchange_strong(
+                expected, 1, std::memory_order_acq_rel))


dong-hao · 2026-05-09T05:31:04Z

+    ! Release GPU lock so other processes can hook on
+#if defined(CUDA) || defined(HIP)
+    if (device_id >= 0) then
+        call cf_releaseDev(device_id)


Well, NO- resetting this (while finishing one solve task for a transmitter) will reset the shared lock for the other processes - will definitely cause some (undefined) problems.

f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip - Race condition: process-local init guard Previously every MPI rank entered the loop may destroy locks held by other ranks and allowing simultaneous GPU attachment (for the first time). Now replaced per-process static std::atomic<int> init_guard(0) with a shared-memory initialized field inside GpuLock. The first process to CAS initialized 0-->1, subsequent processes should skip. - Undefined behavior: std::atomic on unmapped raw storage mmap() returns raw bytes; std::atomic<int> constructors were never run. Object lifetimes now properly begin via placement-new, coordinated by a single unsigned char constructed flag protected by __atomic_test_and_set (ACQ_REL). This flag has no lifetime requirements per the C++ standard, making the entire initialization sequence UB-free. Added #include <new>. - dev_idx not validated before lock access cf_hookDev() previously checked only dev_idx + 1 > nDevices. Negative indices (e.g. -1 → 0) and values exceeding LOCK_MAX_DEVICES were accepted, leading to OOB writes into the occupied[] array. Now validates: dev_idx < 0 || dev_idx >= nDevices || dev_idx >= LOCK_MAX_DEVICES before any lock or CUDA/HIP access. - cf_cleanupLock() never invoked in Added cf_cleanupLock() calls at both solver exit points (FWDsolve3D, FWDsolve3Dfg). The fine-grained path calls cleanup after the MPI barrier on rank 0 only (idempotent). Previously the POSIX shared memory segment /ModEM_gpu_lock persisted across program runs, causing possible problems to operate on stale :wqlock state.

MiCurry

Hope you don't mind me taking a peak and providing some comments.

Another suggestion would be to move the locking mechanism into a separate file (gpu_locking.h?) and then allowing the HIP and CUDA code can import that file and use the routines. That would eliminate the need to duplicate the code and make development easier for the locking routines.

Any thoughts on using pthread mutexes with PTHREAD_PROCESS_SHARED?

MiCurry · 2026-05-08T20:54:26Z

+    const char* name = "/ModEM_gpu_lock";
+
+    // Try to open existing shared memory segment first
+    int fd = shm_open(name, O_RDWR, 0666);


Should we use 0600 here instead of 0666? 0666 gives RW to users, group and others. Is group and other access necessary? Might be better to just do 0600 for RW for the user.

Thanks for the comments! 0600 is surely better (just wrote 666 for some quick testing). Putting the lock mechanism into a separate header file is also a nice idea (currently just writing the same lines twice for both HIP and CUDA...). Useful, especially considering that we may want to expand to other interfaces. Not sure if INTEL cards will die out again - but there are some Chinese cards emerging, as well.

On the other hand, pthread mutexes should indeed be useful for inter-process synchronization - but we may not need to add more complexity to the C part (for the current stage).

MiCurry · 2026-05-08T20:54:30Z

+
+    // Ensure all flags are zero on first creation. Only one process does this.
+    static std::atomic<int> init_guard(0);
+    int expected = 0;


What if we use C/C++ macros for 0 and 1 here?

#DEFINE DEVICE_IN_USE 1 #DEFINE DEVICE_FREE 0

Then we could do:

if (init_guard.compare_exchange_strong(DEVICE_FREE, DEVICE_IN_USE)) { for (int i = 0; i < LOCK_MAX_DEVICES; i++) g_lock->occupied[i].store(DEVICE_FREE, std::memory_order_relaxed); } }

thanks @MiCurry indeed - that would add readability.

MiCurry · 2026-05-08T21:03:58Z

+    // Ensure all flags are zero on first creation. Only one process does this.
+    static std::atomic<int> init_guard(0);
+    int expected = 0;
+    if (init_guard.compare_exchange_strong(expected, 1)) {
+        for (int i = 0; i < LOCK_MAX_DEVICES; i++)
+            g_lock->occupied[i].store(0, std::memory_order_relaxed);
+    }


I think Copilot is right here and that init_guard is per-process and so every process is writing over the lock. I think we would need to also use a shared memory map to lock it properly.

MiCurry · 2026-05-08T21:41:33Z

+
+    g_lock = static_cast<GpuLock*>(mmap(nullptr, sizeof(GpuLock),
+                                         PROT_READ | PROT_WRITE,
+                                         MAP_SHARED, fd, 0));


Re Copilot's comment here https://github.com/magnetotellurics/ModEM/pull/54/changes#r3209269177.

I think it's suggesting that we do something like:

void* shm_ptr = mmap(nullptr, sizeof(GpuLock), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); GpuLock* lock = new (shm_ptr) GpuLock(); lock->occupied.store(0, std::memory_order_relaxed);

Otherwise I think the atomic isn't fully created and might be lacking it's internal fields for insuring locking? I could be wrong though 🤔

You are right - I suppose per CPP standard the current code works (it seems) only because std::atomic has a trivial layout - But of course it's not guaranteed to... Now added flag (cnstr) for the code to coordinate an exactly-once construction of std::atomic. Yes, probably should not really rely on the undefined behaviors...

f90/3D_MT/FWD_SP2/EMsolve3D.f90 f90/3D_MT/FWD_SP2/kernel_c.cu f90/3D_MT/FWD_SP2/kernel_c.hip f90/3D_MT/FWD_SP2/gpu_lock.h Move the platform-independent shared-memory lock infrastructure (GpuLock struct, init_gpu_lock, cf_releaseDev, cf_cleanupLock) from kernel_c.cu and kernel_c.hip into the new `gpu_lock.h` header. Both CUDA and HIP kernels now include this header, eliminating some duplicated codes. Also fix the problem of the previous - New file: f90/3D_MT/FWD_SP2/gpu_lock.h (header-only lock impl) - kernel_c.cu: removed duplicated lock code; kept CUDA-specific cf_hookDev() - kernel_c.hip: removed duplicated lock code; kept HIP-specific cf_hookDev() - Removed redundant includes now provided by gpu_lock.h Also: - added a flag to coordinate an exactly-once construction of std::atomic - remove premature cf_cleanupLock() calls from EMsolve3D.f90. - Replaced magic numbers 0/1 with DEVICE_FREE / DEVICE_IN_USE macros - Restricted shm_open permissions from 0666 to 0600 (owner-only)

f90/3D_MT/FWD_SP2/gpu_lock.h - Replace __atomic_test_and_set with __atomic_exchange_n on `cnstr`, combining winner-selection and flag-setting into one operation. - cnstr now serves as both the raw-storage constructed flag and the cross-process init guard.

dong-hao added the enhancement New feature or request label May 7, 2026

dong-hao requested a review from Copilot May 8, 2026 14:12

Copilot started reviewing on behalf of dong-hao May 8, 2026 14:13 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

MiCurry reviewed May 8, 2026

View reviewed changes

dong-hao added 2 commits May 9, 2026 17:30

dong-hao merged commit 44b47c3 into magnetotellurics:main May 19, 2026
1 check passed

Conversation

dong-hao commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MiCurry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dong-hao May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dong-hao commented May 7, 2026 •

edited

Loading

dong-hao May 9, 2026 •

edited

Loading