Skip to content

Weak memory model issues #1353

Open
Open
@wkozaczuk

Description

@wkozaczuk

This does not describe any specific issue but is more about tracking past or future problems related to weak memory model peculiarities.

Unlike the x86_64, the aarch64 architecture poses a higher difficulty level when synchronizing reads from and writes to shared memory across different threads and cpus. It is somewhat explained in the comment to the commit c953384.

There are three related but different basic aspects of the memory model:

  • atomicity - whether a read or write can be interrupted and/or fully or partially executed
  • memory order - order of execution of other operations before and after the one in question
  • visibility - under what conditions changes to the memory by one thread or cpu will be visible by another thread or cpu

There are three relevant flavors of memory order (the 4th Release-Consume is a theoretical one):

  • Relaxed (weakest)
  • Release-Acquire (in the middle)
  • Sequentially-consistent (strongest)

And, most of the "strong" and "weak" characteristics of the model pertain to the default memory order and visibility.

On x86_64 reads and writes of all size-aligned variables up to 8 bytes (u64 aligned at 8 bytes, u32 aligned at 4, etc) are atomic. Therefore all atomic load() and store() operations with the exception of store() with the std::memory_order_seq_cst do not require any special instructions (like memory barrier ones). The compiler is prohibited from generating a machine code that violates the memory ordering constraints imposed by the memory order argument of the load() and store() operations. And the release-acquire ordering is automatic for the majority of operations.

On a weakly ordered aarch64, similarly to x86_64, the reads and writes of properly aligned variables are atomic (for example the regular ldr and str instructions to read from memory to a register and vice-versa are atomic). But the default memory order is the relaxed one, and cpu is free to reorder execution of the surrounding instructions and make its effects visible to other cpus. So for example the release-acquire order requires special instructions like ldar and stlr or extra memory barrier instructions.

One practical example of where the memory order matters, is waking a thread t2 that waits for some condition involving a variable v by another thread t1 which changes the variable v to make the condition true. If t1 is running on a different cpu than t2, will t2 see changes to the variable v made by t1 after t2 is woken up and gets to run? Is the sched::wake_impl() and other related code implemented correctly to guarantee this?

The commit c953384 addressed some of the issues related to the waking scenario.

Beyond the wake_impl(), we also need to be sure that our lock-free mutex supports the acquire-release semantics as described by https://en.cppreference.com/w/cpp/atomic/memory_order. This means that the memory changes made within the lock and unlock on a given mutex should be visible to other threads that proceed after waiting on the same mutex. The same applies to thread starting and join() operations.

A good example is part of the mutex implementation where we handle an uncontended lock and an unlock:

 24 void mutex::lock()
 25 {
 26     trace_mutex_lock(this);
 27 
 28     sched::thread *current = sched::thread::current();
 29 
 30     if (count.fetch_add(1, std::memory_order_acquire) == 0) {
 31         // Uncontended case (no other thread is holding the lock, and no
 32         // concurrent lock() attempts). We got the lock.
 33         // Setting count=1 already got us the lock; we set owner and depth
 34         // just for implementing a recursive mutex.
 35         owner.store(current, std::memory_order_relaxed);
 36         depth = 1;
 37         return;
 38     }
 39 
 40     // If we're here the mutex was already locked, but we're implementing
 41     // a recursive mutex so it's possible the lock holder is us - in which
 42     // case we need to increment depth instead of waiting.
 43     if (owner.load(std::memory_order_relaxed) == current) {
 44         count.fetch_add(-1, std::memory_order_relaxed);
 45         ++depth;
 46         return;
 47     }
...
214 void mutex::unlock()
215 {
216     trace_mutex_unlock(this);
217 
218     // We assume unlock() is only ever called when this thread is holding
219     // the lock. The following assertions don't seem to add any measurable
220     // performance penalty, so we leave them in.
221     assert(owner.load(std::memory_order_relaxed) == sched::thread::current());
222     assert(depth!=0);
223     if (--depth)
224         return; // recursive mutex still locked.
225 
226     // When we return from unlock(), we will no longer be holding the lock.
227     // We can't leave owner==current, otherwise a later lock() in the same
228     // thread will think it's a recursive lock, while actually another thread
229     // is in the middle of acquiring the lock and has set count>0.
230     owner.store(nullptr, std::memory_order_relaxed);
231 
232     // If there is no waiting lock(), we're done. This is the easy case :-)
233     if (count.fetch_add(-1, std::memory_order_release) == 1) {
234         return;
235     }

The owner and count are different atomic variables, which is important as per https://en.cppreference.com/w/cpp/atomic/memory_order the synchronize with can only be established by operations on the same atomic. Should the owner.store() in line 35 be changed to std::memory_order_release to establish the synchronize with with line 43 where the owner.load() should change to std::memory_order_acquire given the lock() may be called on two different cpus. Should the owner.store() in line 230 be changed to use std::memory_order_release? Or maybe the count.fetch_add(1, std::memory_order_acquire) in line 30 and count.fetch_add(-1, std::memory_order_release) in line 233 are enough to provide necessary synchronization guarantees. But maybe they should be changed to use std::memory_order_ack_rel given the fetch_add() is a read-modify-write operation.

Similarly, there are other places in the OSv code, where we use atomic load() and store() with the std::memory_order_relaxed which are enough to make relevant logic work on x86_64 but may need to be changed to std::memory_order_acquire, std::memory_order_release or std::memory_order_acq_rel to make sure it works flawlessly on aarch64 as well.

Related issues:

Related commits:

Some good reading besides the C++ Memory Model spec:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions