Description
This does not describe any specific issue but is more about tracking past or future problems related to weak memory model peculiarities.
Unlike the x86_64, the aarch64 architecture poses a higher difficulty level when synchronizing reads from and writes to shared memory across different threads and cpus. It is somewhat explained in the comment to the commit c953384.
There are three related but different basic aspects of the memory model:
- atomicity - whether a read or write can be interrupted and/or fully or partially executed
- memory order - order of execution of other operations before and after the one in question
- visibility - under what conditions changes to the memory by one thread or cpu will be visible by another thread or cpu
There are three relevant flavors of memory order (the 4th Release-Consume is a theoretical one):
- Relaxed (weakest)
- Release-Acquire (in the middle)
- Sequentially-consistent (strongest)
And, most of the "strong" and "weak" characteristics of the model pertain to the default memory order and visibility.
On x86_64 reads and writes of all size-aligned variables up to 8 bytes (u64 aligned at 8 bytes, u32 aligned at 4, etc) are atomic. Therefore all atomic load()
and store()
operations with the exception of store()
with the std::memory_order_seq_cst
do not require any special instructions (like memory barrier ones). The compiler is prohibited from generating a machine code that violates the memory ordering constraints imposed by the memory order argument of the load()
and store()
operations. And the release-acquire ordering is automatic for the majority of operations.
On a weakly ordered aarch64, similarly to x86_64, the reads and writes of properly aligned variables are atomic (for example the regular ldr
and str
instructions to read from memory to a register and vice-versa are atomic). But the default memory order is the relaxed one, and cpu is free to reorder execution of the surrounding instructions and make its effects visible to other cpus. So for example the release-acquire
order requires special instructions like ldar
and stlr
or extra memory barrier instructions.
One practical example of where the memory order matters, is waking a thread t2
that waits for some condition involving a variable v
by another thread t1
which changes the variable v
to make the condition true. If t1
is running on a different cpu than t2
, will t2
see changes to the variable v
made by t1
after t2
is woken up and gets to run? Is the sched::wake_impl()
and other related code implemented correctly to guarantee this?
The commit c953384 addressed some of the issues related to the waking scenario.
Beyond the wake_impl()
, we also need to be sure that our lock-free mutex supports the acquire-release semantics as described by https://en.cppreference.com/w/cpp/atomic/memory_order. This means that the memory changes made within the lock
and unlock
on a given mutex should be visible to other threads that proceed after waiting on the same mutex. The same applies to thread starting and join()
operations.
A good example is part of the mutex implementation where we handle an uncontended lock and an unlock:
24 void mutex::lock()
25 {
26 trace_mutex_lock(this);
27
28 sched::thread *current = sched::thread::current();
29
30 if (count.fetch_add(1, std::memory_order_acquire) == 0) {
31 // Uncontended case (no other thread is holding the lock, and no
32 // concurrent lock() attempts). We got the lock.
33 // Setting count=1 already got us the lock; we set owner and depth
34 // just for implementing a recursive mutex.
35 owner.store(current, std::memory_order_relaxed);
36 depth = 1;
37 return;
38 }
39
40 // If we're here the mutex was already locked, but we're implementing
41 // a recursive mutex so it's possible the lock holder is us - in which
42 // case we need to increment depth instead of waiting.
43 if (owner.load(std::memory_order_relaxed) == current) {
44 count.fetch_add(-1, std::memory_order_relaxed);
45 ++depth;
46 return;
47 }
...
214 void mutex::unlock()
215 {
216 trace_mutex_unlock(this);
217
218 // We assume unlock() is only ever called when this thread is holding
219 // the lock. The following assertions don't seem to add any measurable
220 // performance penalty, so we leave them in.
221 assert(owner.load(std::memory_order_relaxed) == sched::thread::current());
222 assert(depth!=0);
223 if (--depth)
224 return; // recursive mutex still locked.
225
226 // When we return from unlock(), we will no longer be holding the lock.
227 // We can't leave owner==current, otherwise a later lock() in the same
228 // thread will think it's a recursive lock, while actually another thread
229 // is in the middle of acquiring the lock and has set count>0.
230 owner.store(nullptr, std::memory_order_relaxed);
231
232 // If there is no waiting lock(), we're done. This is the easy case :-)
233 if (count.fetch_add(-1, std::memory_order_release) == 1) {
234 return;
235 }
The owner
and count
are different atomic variables, which is important as per https://en.cppreference.com/w/cpp/atomic/memory_order the synchronize with can only be established by operations on the same atomic. Should the owner.store()
in line 35 be changed to std::memory_order_release
to establish the synchronize with with line 43 where the owner.load()
should change to std::memory_order_acquire
given the lock()
may be called on two different cpus. Should the owner.store()
in line 230 be changed to use std::memory_order_release
? Or maybe the count.fetch_add(1, std::memory_order_acquire)
in line 30 and count.fetch_add(-1, std::memory_order_release)
in line 233 are enough to provide necessary synchronization guarantees. But maybe they should be changed to use std::memory_order_ack_rel
given the fetch_add()
is a read-modify-write operation.
Similarly, there are other places in the OSv code, where we use atomic load()
and store()
with the std::memory_order_relaxed
which are enough to make relevant logic work on x86_64 but may need to be changed to std::memory_order_acquire
, std::memory_order_release
or std::memory_order_acq_rel
to make sure it works flawlessly on aarch64 as well.
Related issues:
- aarch64: virtio::register_mmio_devices() sometimes page faults on real hardware like RockPro64 or Odroid #1117
- aarch64: early boot hangs sometimes in SMP mode (#vCPUs >=2) #1123
- aarch64: the do_main_thread hangs in osv::rcu_flush() in SMP mode #1134
- aarch64: tst-rcu-hashtable sometimes crashes in SMP mode #1159
Related commits:
Some good reading besides the C++ Memory Model spec: