Skip to content

Commit c01ce61

Browse files
committed
Deploy Typst PDFs to GitHub Pages [skip ci]
1 parent bba54fd commit c01ce61

2 files changed

Lines changed: 54 additions & 127 deletions

File tree

README.md

Lines changed: 54 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -35,171 +35,98 @@
3535

3636
![MPiSC](https://img.shields.io/badge/MPiSC-blue?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyTDIgN2wxMCA1IDEwLTV6TTIgMTdsMTAgNSAxMC01TTIgMTJsMTAgNSAxMC01Ii8+PC9zdmc+) ![Status](https://img.shields.io/badge/status-complete-brightgreen) [![Thesis](https://img.shields.io/badge/thesis-h--dna.github.io-informational)](https://h-dna.github.io/MPiSC/)
3737

38-
This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA).
39-
4038
### Table of Contents
4139

42-
- [Objective](#objective)
43-
- [Motivation](#motivation)
44-
- [Approach](#approach)
45-
- [Why MPI RMA?](#why-mpi-rma)
46-
- [Why MPI-3 RMA?](#why-mpi-3-rma)
47-
- [Hybrid MPI+MPI](#hybrid-mpimpi)
48-
- [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11)
49-
- [Lock-Free MPI Porting](#lock-free-mpi-porting)
50-
- [Literature Review](#literature-review)
51-
- [Known Problems](#known-problems)
52-
- [Trends](#trends)
53-
- [Evaluation Strategy](#evaluation-strategy)
54-
- [Correctness](#correctness)
55-
- [Lock-Freedom](#lock-freedom)
56-
- [Performance](#performance)
57-
- [Scalability](#scalability)
40+
- [Abstract](#abstract)
41+
- [Motivation and Methodology](#motivation-and-methodology)
42+
- [Contributions](#contributions)
43+
- [Results](#results)
5844
- [Related](#related)
5945

60-
### Objective
61-
62-
- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms.
63-
- Port candidate algorithms to distributed contexts using MPI-3 RMA.
64-
- Optimize ports using MPI-3 SHM and the C++11 memory model.
65-
66-
Target characteristics:
67-
68-
| Dimension | Requirement |
69-
| ------------------- | ----------------------- |
70-
| Queue length | Fixed |
71-
| Number of producers | Multiple |
72-
| Number of consumers | Single |
73-
| Operations | `enqueue`, `dequeue` |
74-
| Progress guarantee | Lock-free |
75-
76-
### Motivation
77-
78-
Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems.
79-
80-
Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues.
81-
82-
### Approach
83-
84-
We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication.
85-
86-
#### Why MPI RMA?
87-
88-
MPSC queues are *irregular* applications:
89-
90-
- Memory access patterns are dynamic.
91-
- Data locations are determined at runtime.
92-
93-
Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters.
94-
95-
#### Why MPI-3 RMA?
96-
97-
MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization.
98-
99-
#### Hybrid MPI+MPI
46+
### Abstract
10047

101-
Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes.
48+
Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel **wait-free** distributed MPSC queues: **dLTQueue**, **Slotqueue**, and **dLTQueueV2**. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance.
10249

103-
#### Hybrid MPI+MPI+C++11
50+
### Motivation and Methodology
10451

105-
C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path.
52+
#### The Problem
10653

107-
#### Lock-Free MPI Porting
54+
MPSC queues are essential for **irregular applications** — programs with unpredictable, data-dependent memory access patterns:
10855

109-
MPI-3 RMA enables lock-free implementations:
56+
- **Actor model**: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors
57+
- **Fan-out/fan-in**: Worker nodes enqueue results to an aggregation node for processing
11058

111-
- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs.
112-
- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization.
59+
These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system.
11360

114-
### Literature Review
61+
#### Gap in the Literature
11562

116-
#### Known Problems
63+
**Shared-memory** has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most:
11764

118-
* **ABA problem**
65+
| Queue | Issue |
66+
|-------|-------|
67+
| DQueue | Incorrect ABA solution and unsafe memory reclamation |
68+
| WRLQueue | Actually **blocking** — dequeuer waits for all enqueuers |
69+
| Jiffy | Insufficient memory reclamation, not truly wait-free |
70+
| **LTQueue** | **Correct** — uses LL/SC for ABA, proper memory reclamation |
11971

120-
A pointer is reused after deallocation, causing a CAS to incorrectly succeed.
72+
**Distributed** has only one MPSC queue: **AMQueue**. Despite claiming lock-freedom, it is actually **blocking** — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. ([Confirmed by the original author](assets/amqueue-blocking-evidence.png))
12173

122-
Solutions: Monotonic counters, hazard pointers.
74+
#### Our Approach
12375

124-
* **Safe memory reclamation**
76+
We adapt **LTQueue** — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication.
12577

126-
Premature deallocation while other threads hold references.
78+
**Key challenge**: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI.
12779

128-
Solutions: Hazard pointers, epoch-based reclamation.
80+
**Our solution**: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC.
12981

130-
* **Empty queue contention**
82+
**Key techniques**:
83+
- **SPSC-per-enqueuer**: Each producer maintains a local queue, eliminating producer contention
84+
- **Unique timestamps**: Solves ABA via monotonic version numbers
85+
- **Double-refresh**: Bounds retries to two per node, ensuring wait-freedom
13186

132-
Concurrent `enqueue` and `dequeue` on an empty queue can conflict.
87+
### Contributions
13388

134-
Solutions: Sentinel node to separate head and tail pointers.
89+
#### Findings
13590

136-
* **Intermediate state from slow processes**
91+
- **3 of 4** shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues
92+
- **AMQueue**, the only distributed MPSC queue, is blocking despite claims of lock-freedom
93+
- **LTQueue** is the only correct candidate for distributed adaptation
13794

138-
A delayed process may leave the queue in an inconsistent state mid-operation.
95+
#### Novel Algorithms
13996

140-
Solutions: Helping—other processes complete the pending operation.
97+
| Algorithm | Progress | Enqueue | Dequeue |
98+
|-----------|----------|---------|---------|
99+
| **dLTQueue** | Wait-free | O(log n) remote | O(log n) remote |
100+
| **Slotqueue** | Wait-free | O(1) remote | O(1) remote, O(n) local |
101+
| **dLTQueueV2** | Wait-free | O(1) remote | O(1) remote, O(log n) local |
141102

142-
* **Intermediate state from failed processes**
103+
All algorithms are **linearizable** with no dynamic memory allocation.
143104

144-
A crashed process may leave the queue permanently inconsistent.
105+
### Results
145106

146-
Solutions: Helping mechanisms that can complete any pending operation.
107+
Benchmarked on [SuperMUC-NG](https://doku.lrz.de/supermuc-ng-10745965.html) (6000+ nodes) and [CoolMUC-4](https://doku.lrz.de/coolmuc-4-10746415.html) (100+ nodes):
147108

148-
* **Help mechanism rationale**
109+
| Metric | Our Queues vs AMQueue |
110+
|--------|----------------------|
111+
| Enqueue throughput | **~2x better** |
112+
| Dequeue throughput | 3-10x worse |
113+
| Fault tolerance | **Wait-free** (vs blocking) |
149114

150-
Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation:
151-
152-
1. Detect intermediate state
153-
2. Attempt completion via CAS
154-
155-
A failed CAS indicates another process already helped; retry is unnecessary.
156-
157-
#### Trends
158-
159-
- Fast-path optimization
160-
- Lock-free fast path with wait-free fallback
161-
- Replace CAS with FAA or load/store where possible
162-
- Contention reduction
163-
- Per-producer local buffers
164-
- Elimination and backoff (for MPMC)
165-
- Cache-aware design
166-
167-
### Evaluation Strategy
168-
169-
We focus on the following criteria, in the order of decreasing importance:
170-
* Correctness
171-
* Lock-freedom
172-
* Performance & Scalability
173-
174-
#### Correctness
175-
176-
- Linearizability
177-
- ABA-freedom
178-
- Safe memory reclamation
179-
180-
#### Lock-Freedom
181-
182-
No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform.
183-
184-
#### Performance
185-
186-
Minimize latency and maximize throughput for target workloads.
187-
188-
#### Scalability
189-
190-
Throughput should scale with process count.
115+
**Trade-off**: Stronger fault tolerance at the cost of dequeue performance.
191116

192117
### Related
193118

194-
- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue)
195-
- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations)
119+
1. **dLTQueue** - FDSE 2025 ([ResearchGate](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue))
120+
2. **Slotqueue** - NPC 2025 ([ResearchGate](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations))
121+
122+
[Full thesis](https://h-dna.github.io/MPiSC/)
196123

197124

198125
---
199126

200127
<div align="center">
201128
<p>
202-
<small>Last build: Sun Jan 11 03:21:14 UTC 2026</small><br>
129+
<small>Last build: Sun Jan 11 17:10:09 UTC 2026</small><br>
203130
<small>Generated by GitHub Actions • <a href="https://github.com/H-DNA/MPiSC/tree/main">View Source</a></small>
204131
</p>
205132
</div>

report/main.pdf

0 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)