|
35 | 35 |
|
36 | 36 |   [](https://h-dna.github.io/MPiSC/) |
37 | 37 |
|
38 | | -This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA). |
39 | | - |
40 | 38 | ### Table of Contents |
41 | 39 |
|
42 | | -- [Objective](#objective) |
43 | | -- [Motivation](#motivation) |
44 | | -- [Approach](#approach) |
45 | | - - [Why MPI RMA?](#why-mpi-rma) |
46 | | - - [Why MPI-3 RMA?](#why-mpi-3-rma) |
47 | | - - [Hybrid MPI+MPI](#hybrid-mpimpi) |
48 | | - - [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11) |
49 | | - - [Lock-Free MPI Porting](#lock-free-mpi-porting) |
50 | | -- [Literature Review](#literature-review) |
51 | | - - [Known Problems](#known-problems) |
52 | | - - [Trends](#trends) |
53 | | -- [Evaluation Strategy](#evaluation-strategy) |
54 | | - - [Correctness](#correctness) |
55 | | - - [Lock-Freedom](#lock-freedom) |
56 | | - - [Performance](#performance) |
57 | | - - [Scalability](#scalability) |
| 40 | +- [Abstract](#abstract) |
| 41 | +- [Motivation and Methodology](#motivation-and-methodology) |
| 42 | +- [Contributions](#contributions) |
| 43 | +- [Results](#results) |
58 | 44 | - [Related](#related) |
59 | 45 |
|
60 | | -### Objective |
61 | | - |
62 | | -- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms. |
63 | | -- Port candidate algorithms to distributed contexts using MPI-3 RMA. |
64 | | -- Optimize ports using MPI-3 SHM and the C++11 memory model. |
65 | | - |
66 | | -Target characteristics: |
67 | | - |
68 | | -| Dimension | Requirement | |
69 | | -| ------------------- | ----------------------- | |
70 | | -| Queue length | Fixed | |
71 | | -| Number of producers | Multiple | |
72 | | -| Number of consumers | Single | |
73 | | -| Operations | `enqueue`, `dequeue` | |
74 | | -| Progress guarantee | Lock-free | |
75 | | - |
76 | | -### Motivation |
77 | | - |
78 | | -Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems. |
79 | | - |
80 | | -Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues. |
81 | | - |
82 | | -### Approach |
83 | | - |
84 | | -We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication. |
85 | | - |
86 | | -#### Why MPI RMA? |
87 | | - |
88 | | -MPSC queues are *irregular* applications: |
89 | | - |
90 | | -- Memory access patterns are dynamic. |
91 | | -- Data locations are determined at runtime. |
92 | | - |
93 | | -Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters. |
94 | | - |
95 | | -#### Why MPI-3 RMA? |
96 | | - |
97 | | -MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization. |
98 | | - |
99 | | -#### Hybrid MPI+MPI |
| 46 | +### Abstract |
100 | 47 |
|
101 | | -Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes. |
| 48 | +Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel **wait-free** distributed MPSC queues: **dLTQueue**, **Slotqueue**, and **dLTQueueV2**. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance. |
102 | 49 |
|
103 | | -#### Hybrid MPI+MPI+C++11 |
| 50 | +### Motivation and Methodology |
104 | 51 |
|
105 | | -C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path. |
| 52 | +#### The Problem |
106 | 53 |
|
107 | | -#### Lock-Free MPI Porting |
| 54 | +MPSC queues are essential for **irregular applications** — programs with unpredictable, data-dependent memory access patterns: |
108 | 55 |
|
109 | | -MPI-3 RMA enables lock-free implementations: |
| 56 | +- **Actor model**: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors |
| 57 | +- **Fan-out/fan-in**: Worker nodes enqueue results to an aggregation node for processing |
110 | 58 |
|
111 | | -- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs. |
112 | | -- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization. |
| 59 | +These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system. |
113 | 60 |
|
114 | | -### Literature Review |
| 61 | +#### Gap in the Literature |
115 | 62 |
|
116 | | -#### Known Problems |
| 63 | +**Shared-memory** has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most: |
117 | 64 |
|
118 | | -* **ABA problem** |
| 65 | +| Queue | Issue | |
| 66 | +|-------|-------| |
| 67 | +| DQueue | Incorrect ABA solution and unsafe memory reclamation | |
| 68 | +| WRLQueue | Actually **blocking** — dequeuer waits for all enqueuers | |
| 69 | +| Jiffy | Insufficient memory reclamation, not truly wait-free | |
| 70 | +| **LTQueue** | **Correct** — uses LL/SC for ABA, proper memory reclamation | |
119 | 71 |
|
120 | | -A pointer is reused after deallocation, causing a CAS to incorrectly succeed. |
| 72 | +**Distributed** has only one MPSC queue: **AMQueue**. Despite claiming lock-freedom, it is actually **blocking** — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. ([Confirmed by the original author](assets/amqueue-blocking-evidence.png)) |
121 | 73 |
|
122 | | -Solutions: Monotonic counters, hazard pointers. |
| 74 | +#### Our Approach |
123 | 75 |
|
124 | | -* **Safe memory reclamation** |
| 76 | +We adapt **LTQueue** — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication. |
125 | 77 |
|
126 | | -Premature deallocation while other threads hold references. |
| 78 | +**Key challenge**: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI. |
127 | 79 |
|
128 | | -Solutions: Hazard pointers, epoch-based reclamation. |
| 80 | +**Our solution**: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC. |
129 | 81 |
|
130 | | -* **Empty queue contention** |
| 82 | +**Key techniques**: |
| 83 | +- **SPSC-per-enqueuer**: Each producer maintains a local queue, eliminating producer contention |
| 84 | +- **Unique timestamps**: Solves ABA via monotonic version numbers |
| 85 | +- **Double-refresh**: Bounds retries to two per node, ensuring wait-freedom |
131 | 86 |
|
132 | | -Concurrent `enqueue` and `dequeue` on an empty queue can conflict. |
| 87 | +### Contributions |
133 | 88 |
|
134 | | -Solutions: Sentinel node to separate head and tail pointers. |
| 89 | +#### Findings |
135 | 90 |
|
136 | | -* **Intermediate state from slow processes** |
| 91 | +- **3 of 4** shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues |
| 92 | +- **AMQueue**, the only distributed MPSC queue, is blocking despite claims of lock-freedom |
| 93 | +- **LTQueue** is the only correct candidate for distributed adaptation |
137 | 94 |
|
138 | | -A delayed process may leave the queue in an inconsistent state mid-operation. |
| 95 | +#### Novel Algorithms |
139 | 96 |
|
140 | | -Solutions: Helping—other processes complete the pending operation. |
| 97 | +| Algorithm | Progress | Enqueue | Dequeue | |
| 98 | +|-----------|----------|---------|---------| |
| 99 | +| **dLTQueue** | Wait-free | O(log n) remote | O(log n) remote | |
| 100 | +| **Slotqueue** | Wait-free | O(1) remote | O(1) remote, O(n) local | |
| 101 | +| **dLTQueueV2** | Wait-free | O(1) remote | O(1) remote, O(log n) local | |
141 | 102 |
|
142 | | -* **Intermediate state from failed processes** |
| 103 | +All algorithms are **linearizable** with no dynamic memory allocation. |
143 | 104 |
|
144 | | -A crashed process may leave the queue permanently inconsistent. |
| 105 | +### Results |
145 | 106 |
|
146 | | -Solutions: Helping mechanisms that can complete any pending operation. |
| 107 | +Benchmarked on [SuperMUC-NG](https://doku.lrz.de/supermuc-ng-10745965.html) (6000+ nodes) and [CoolMUC-4](https://doku.lrz.de/coolmuc-4-10746415.html) (100+ nodes): |
147 | 108 |
|
148 | | -* **Help mechanism rationale** |
| 109 | +| Metric | Our Queues vs AMQueue | |
| 110 | +|--------|----------------------| |
| 111 | +| Enqueue throughput | **~2x better** | |
| 112 | +| Dequeue throughput | 3-10x worse | |
| 113 | +| Fault tolerance | **Wait-free** (vs blocking) | |
149 | 114 |
|
150 | | -Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation: |
151 | | - |
152 | | -1. Detect intermediate state |
153 | | -2. Attempt completion via CAS |
154 | | - |
155 | | -A failed CAS indicates another process already helped; retry is unnecessary. |
156 | | - |
157 | | -#### Trends |
158 | | - |
159 | | -- Fast-path optimization |
160 | | - - Lock-free fast path with wait-free fallback |
161 | | - - Replace CAS with FAA or load/store where possible |
162 | | -- Contention reduction |
163 | | - - Per-producer local buffers |
164 | | - - Elimination and backoff (for MPMC) |
165 | | -- Cache-aware design |
166 | | - |
167 | | -### Evaluation Strategy |
168 | | - |
169 | | -We focus on the following criteria, in the order of decreasing importance: |
170 | | -* Correctness |
171 | | -* Lock-freedom |
172 | | -* Performance & Scalability |
173 | | - |
174 | | -#### Correctness |
175 | | - |
176 | | -- Linearizability |
177 | | -- ABA-freedom |
178 | | -- Safe memory reclamation |
179 | | - |
180 | | -#### Lock-Freedom |
181 | | - |
182 | | -No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform. |
183 | | - |
184 | | -#### Performance |
185 | | - |
186 | | -Minimize latency and maximize throughput for target workloads. |
187 | | - |
188 | | -#### Scalability |
189 | | - |
190 | | -Throughput should scale with process count. |
| 115 | +**Trade-off**: Stronger fault tolerance at the cost of dequeue performance. |
191 | 116 |
|
192 | 117 | ### Related |
193 | 118 |
|
194 | | -- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue) |
195 | | -- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations) |
| 119 | +1. **dLTQueue** - FDSE 2025 ([ResearchGate](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue)) |
| 120 | +2. **Slotqueue** - NPC 2025 ([ResearchGate](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations)) |
| 121 | + |
| 122 | +[Full thesis](https://h-dna.github.io/MPiSC/) |
196 | 123 |
|
197 | 124 |
|
198 | 125 | --- |
199 | 126 |
|
200 | 127 | <div align="center"> |
201 | 128 | <p> |
202 | | - <small>Last build: Sun Jan 11 03:21:14 UTC 2026</small><br> |
| 129 | + <small>Last build: Sun Jan 11 17:10:09 UTC 2026</small><br> |
203 | 130 | <small>Generated by GitHub Actions • <a href="https://github.com/H-DNA/MPiSC/tree/main">View Source</a></small> |
204 | 131 | </p> |
205 | 132 | </div> |
0 commit comments