Weak memory model issues

This does not describe any specific issue but is more about tracking past or future problems related to weak memory model peculiarities.

Unlike the x86_64, the aarch64 architecture poses a higher difficulty level when synchronizing reads from and writes to shared memory across different threads and cpus. It is somewhat explained in the comment to the commit c953384e4e047616730ab3e8ffde8c120d02d082. 

There are three related but different basic aspects of the memory model:
- **atomicity** - whether a read or write can be interrupted and/or fully or partially executed
- **memory order** - order of execution of other operations before and after the one in question
- **visibility** - under what conditions changes to the memory by one thread or cpu will be visible by another thread or cpu

There are three relevant flavors of  **memory order** (the 4th Release-Consume is a theoretical one):
- Relaxed (weakest)
- Release-Acquire (in the middle)
- Sequentially-consistent (strongest)

And, most of the "strong" and "weak" characteristics of the model pertain to the default memory order and visibility.

On x86_64 reads and writes of all size-aligned variables up to 8 bytes (u64 aligned at 8 bytes, u32 aligned at 4, etc) are atomic. Therefore all atomic `load()` and `store()` operations with the exception of `store()` with the `std::memory_order_seq_cst` do not require any special instructions (like memory barrier ones). The compiler is prohibited from generating a machine code that violates the memory ordering constraints imposed by the memory order argument of the `load()` and `store()` operations. And the release-acquire ordering is automatic for the majority of operations.

On a weakly ordered aarch64, similarly to x86_64, the reads and writes of properly aligned variables are atomic (for example the regular `ldr` and `str` instructions to read from memory to a register and vice-versa are atomic). But the default memory order is the relaxed one, and cpu is free to reorder execution of the surrounding instructions and make its effects visible to other cpus. So for example the `release-acquire` order requires special instructions like `ldar` and `stlr` or extra memory barrier instructions.

One practical example of where the memory order matters, is waking a thread `t2` that waits for some condition involving a variable `v` by another thread `t1` which changes the variable `v` to make the condition true. If `t1` is running on a different cpu than `t2`, will `t2` see changes to the variable `v` made by `t1` after `t2` is woken up and gets to run? Is the `sched::wake_impl()` and other related code implemented correctly to guarantee this?

The commit c953384e4e047616730ab3e8ffde8c120d02d082 addressed some of the issues related to the waking scenario.

Beyond the `wake_impl()`, we also need to be sure that our lock-free mutex supports the acquire-release semantics as described by https://en.cppreference.com/w/cpp/atomic/memory_order. This means that the memory changes made within the `lock` and `unlock` on a given mutex should be visible to other threads that proceed after waiting on the same mutex. The same applies to thread starting and `join()` operations.

A good example is part of the mutex implementation where we handle an uncontended lock and an unlock:
```c++
 24 void mutex::lock()
 25 {
 26     trace_mutex_lock(this);
 27 
 28     sched::thread *current = sched::thread::current();
 29 
 30     if (count.fetch_add(1, std::memory_order_acquire) == 0) {
 31         // Uncontended case (no other thread is holding the lock, and no
 32         // concurrent lock() attempts). We got the lock.
 33         // Setting count=1 already got us the lock; we set owner and depth
 34         // just for implementing a recursive mutex.
 35         owner.store(current, std::memory_order_relaxed);
 36         depth = 1;
 37         return;
 38     }
 39 
 40     // If we're here the mutex was already locked, but we're implementing
 41     // a recursive mutex so it's possible the lock holder is us - in which
 42     // case we need to increment depth instead of waiting.
 43     if (owner.load(std::memory_order_relaxed) == current) {
 44         count.fetch_add(-1, std::memory_order_relaxed);
 45         ++depth;
 46         return;
 47     }
...
214 void mutex::unlock()
215 {
216     trace_mutex_unlock(this);
217 
218     // We assume unlock() is only ever called when this thread is holding
219     // the lock. The following assertions don't seem to add any measurable
220     // performance penalty, so we leave them in.
221     assert(owner.load(std::memory_order_relaxed) == sched::thread::current());
222     assert(depth!=0);
223     if (--depth)
224         return; // recursive mutex still locked.
225 
226     // When we return from unlock(), we will no longer be holding the lock.
227     // We can't leave owner==current, otherwise a later lock() in the same
228     // thread will think it's a recursive lock, while actually another thread
229     // is in the middle of acquiring the lock and has set count>0.
230     owner.store(nullptr, std::memory_order_relaxed);
231 
232     // If there is no waiting lock(), we're done. This is the easy case :-)
233     if (count.fetch_add(-1, std::memory_order_release) == 1) {
234         return;
235     }
```

The `owner` and `count` are different atomic variables, which is important as per https://en.cppreference.com/w/cpp/atomic/memory_order the _synchronize with_ can only be established by operations on the same atomic. Should the `owner.store()` in line 35 be changed to `std::memory_order_release` to establish the _synchronize with_ with line 43 where the `owner.load()` should change to `std::memory_order_acquire` given the `lock()` may be called on two different cpus. Should the `owner.store()` in line 230 be changed to use `std::memory_order_release`? Or maybe the `count.fetch_add(1, std::memory_order_acquire)` in line 30 and `count.fetch_add(-1, std::memory_order_release)` in line 233 are enough to provide necessary synchronization guarantees. But maybe they should be changed to use `std::memory_order_ack_rel` given the `fetch_add()` is a _read-modify-write_ operation.

Similarly, there are other places in the OSv code, where we use atomic `load()` and `store()` with the `std::memory_order_relaxed` which are enough to make relevant logic work on x86_64 but may need to be changed to `std::memory_order_acquire`, `std::memory_order_release` or `std::memory_order_acq_rel` to make sure it works flawlessly on aarch64 as well.

Related issues:
- #1117
- #1123
- #1134
- #1159

Related commits:
- 36432c3a1cef79b7b15d79d9b7d64d82a369f1e2
- 1433c08b66d4d0eef6cc2e8c94539cfaae7dee54
- c953384e4e047616730ab3e8ffde8c120d02d082
- 9054d34bc55d62d2ad74dd9a84519b5c0483a18c
- 7272f35cdc20fd14f5b1b5dde8c61ecfee289ba7

Some good reading besides the [C++ Memory Model spec](https://en.cppreference.com/w/cpp/atomic/memory_order):
- https://www.modernescpp.com/index.php/category/blog/multithreading/multithreading-memory-model/
- https://arangodb.com/2021/02/cpp-memory-model-migrating-from-x86-to-arm/
- https://marabos.nl/atomics/hardware.html


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weak memory model issues #1353

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weak memory model issues #1353

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions