Fix potential data race: Lock all KV maps during `apply_changes`, even those which are read-only #6866

eddyashton · 2025-02-28T14:21:48Z

Taking another look at the still-failing tests in #6616, I managed to parse one of the consistent error stacks and I think this is the right fix.

Explanation

A pseudocode summary of this apply_changes function, which is taking a Tx-owned collection of local maybe-modified views over maps and applying them to the local persistent KV if they're still valid:

  bool any_writes = false;
  for (change: changes)
  {
    if (change.has_writes())
    {
      any_writes = true;
      change.map.lock();
    }
  }

  bool ok = true; // Becomes false if any prepare finds conflicts
  if (any_writes)
  {
    for (change: changes)
    {
      if (!change.prepare()) // Check for conflicts with the current KV state
      {
        ok = false;
        break;
      }
    }
  }

  if (ok)
  {
    for (change: changes)
    {
      change.commit(); // Apply the writes to the current KV state
    }
  }

  for (change: changes)
  {
    if (change.has_writes())
    {
      change.map.unlock();
    }
  }

Most of the ifs in this are about checking for writes. If we have a read-only transaction, then it can have no conflicts*, and we can skip the prepare() step entirely. Since we're not calling prepare() to read from the map, and commit() won't affect the map, we avoid locking these read-only views.

But there's a bug in this thinking - there's a difference between having a read-only view/changeset (change.has_writes()) and a read-only transaction (has_writes==false, all change.has_writes() == false). Writing transactions may contain read-only views! And in those transactions, we still need to run the prepare() check for the read-only views (my write to table A could depend on reading foo from table B, so we need to confirm that B still contains foo).

We previously did call prepare() for all changes, but without locking the read-only changes. That means we're reading the underlying Map memory without a lock, and racing with any other transactions which may be writing to the same Map.

This PR simply makes the lock() (and unlock()) unconditional - we lock all maps. There may be a marginally narrower use of this where we continue to avoid the locks for read-only transactions, but we can't know that until we've iterated over all changes, and there's extra complication from the track_read_version flag. I think this would be a minor optimisation, not worth the complexity.

* If you run a read-only transaction, and read a bunch of state at seqno=15, then even if the KV has been updated to store to seqno=20 by the time you get here, you have no problem - your transaction reports that it read at seqno=15, and that state is still valid.

Traces

Dumping some TSAN trace walk-through here for posterity.

Full trace is here, but 1400 lines of verbose call-stack: tsan_trace.txt

There are 5 data-race warnings. The last 4 are very similar, the first looks a little different but I think has the same root cause. Trimming to the relevant bits of the second:

WARNING: ThreadSanitizer: data race (pid=52264)
  Read of size 8 at 0x7b1c0000b0c8 by thread T4 (mutexes: write M0):
    #0 std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1296:16 (liblogging.virtual.so+0xf901a9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #1 std::__shared_ptr_access<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:993:66 (liblogging.virtual.so+0xf90165) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::__shared_ptr_access<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:987:9 (liblogging.virtual.so+0xf8ff25) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>::get(llvm_vecsmall::SmallVector<unsigned char, 8u> const&) const /data/src/3.CCF/build.tsan/../src/ds/champ_map.h:424:16 (liblogging.virtual.so+0xf95741) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::HandleCommitter::prepare(bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:170:40 (liblogging.virtual.so+0xf944f1) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::apply_changes(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, ccf::kv::MapChanges, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, ccf::kv::MapChanges>>>&, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::vector<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>, std::allocator<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>>>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::shared_ptr<ccf::kv::AbstractMap>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, std::shared_ptr<ccf::kv::AbstractMap>>>> const&, std::optional<unsigned long> const&, bool, bool, std::optional<unsigned long> const&) /data/src/3.CCF/build.tsan/../src/kv/apply_changes.h:98:26 (liblogging.virtual.so+0xebfe64) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #6 ccf::kv::CommittableTx::commit(ccf::ClaimsDigest const&, bool, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::function<void (std::vector<unsigned char, std::allocator<unsigned char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&)>) /data/src/3.CCF/build.tsan/../src/kv/committable_tx.h:163:16 (liblogging.virtual.so+0x108179c) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Previous write of size 8 at 0x7b1c0000b0c8 by thread T2 (mutexes: write M1, write M2, write M3, write M4, write M5):
    #0 std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1162:9 (liblogging.virtual.so+0xf8b681) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #1 std::shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>::shared_ptr(std::shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:305:9 (liblogging.virtual.so+0xf8b615) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>::Map(champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&) /data/src/3.CCF/build.tsan/../src/ds/champ_map.h:384:9 (liblogging.virtual.so+0xf8b559) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::LocalCommit::LocalCommit(unsigned long, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>> const&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:26:7 (liblogging.virtual.so+0xf8b4a1) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::LocalCommit* ccf::kv::untyped::Roll::create_new_local_commit<unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&>(unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:62:17 (liblogging.virtual.so+0xf97126) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::untyped::Map::HandleCommitter::commit(unsigned long, bool, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:272:50 (liblogging.virtual.so+0xf95441) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #6 ccf::kv::apply_changes(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, ccf::kv::MapChanges, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, ccf::kv::MapChanges>>>&, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::vector<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>, std::allocator<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>>>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::shared_ptr<ccf::kv::AbstractMap>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, std::shared_ptr<ccf::kv::AbstractMap>>>> const&, std::optional<unsigned long> const&, bool, bool, std::optional<unsigned long> const&) /data/src/3.CCF/build.tsan/../src/kv/apply_changes.h:157:23 (liblogging.virtual.so+0xec03d9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #7 ccf::kv::ReservedTx::commit_reserved() /data/src/3.CCF/build.tsan/../src/kv/committable_tx.h:427:16 (liblogging.virtual.so+0x13af49a) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Location is heap block of size 104 at 0x7b1c0000b0c0 allocated by thread T2:
    #0 operator new(unsigned long) <null> (cchost+0xe7f96) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 ccf::kv::untyped::LocalCommit* ccf::kv::untyped::Roll::create_new_local_commit<unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&>(unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:57:13 (liblogging.virtual.so+0xf97096) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M0 (0x7b40000084c8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M1 (0x7b4800001350) created at:
    #0 pthread_mutex_trylock <null> (cchost+0x6a0ca) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_trylock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:758:12 (liblogging.virtual.so+0xebf533) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::try_lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:111:15 (liblogging.virtual.so+0xf36445) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 std::unique_lock<std::mutex>::try_lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_lock.h:153:27 (liblogging.virtual.so+0x13ad085) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::'lambda'(std::unique_ptr<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>, std::default_delete<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>>>)::operator()(std::unique_ptr<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>, std::default_delete<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>>>) const /data/src/3.CCF/build.tsan/../src/node/history.h:664:22 (liblogging.virtual.so+0x13b5695) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M2 (0x7b50000026b0) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229:19 (liblogging.virtual.so+0xd84bd9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::Store::commit(ccf::kv::TxID const&, std::unique_ptr<ccf::kv::PendingTx, std::default_delete<ccf::kv::PendingTx>>, bool) /data/src/3.CCF/build.tsan/../src/kv/store.h:885:40 (liblogging.virtual.so+0xf5d1ea) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M3 (0x7b4000008dc8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M4 (0x7b40000085c8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M5 (0x7b4000008cc8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

Mutexes M0, M3, M4, and M5 are mutexes within each Map (created within create_change_set when that Map was first accessed). Thread T2 is going through the signature commit path, so also holds store/sig related mutexes M1 and M2.

We have a data race accessing some champ::SubNodes fields, both under the call-stack of this apply_changes function. Both threads hold some locks, but none of the same, so they race. The commit() side must have some writes, as it's creating new entries, so it should have locked all of the maps it's handling, and be strictly ordered against any nearby prepare()s from other transactions? But not for a map it considered read-only!

eddyashton · 2025-02-28T14:24:41Z

Validation of this fix: I'm running several e2e tests locally which were consistently finding these races before, and are green with this change. It looks (early doors) like the remaining TSAN failures are only timeouts (for tests where we try too much in parallel), rather than TSAN warnings/errors.

EDIT: The claim this was the last data race was too optimistic - #6616 has found at least one more around ledger file access, and likely a handful of others.

src/kv/apply_changes.h

Lock all maps during apply, even those which are read-only

975f405

eddyashton requested a review from a team as a code owner February 28, 2025 14:21

eddyashton added the run-long-test Run Long Test job label Feb 28, 2025

maxtropets reviewed Feb 28, 2025

View reviewed changes

src/kv/apply_changes.h Outdated Show resolved Hide resolved

Simplify

a324e36

maxtropets approved these changes Feb 28, 2025

View reviewed changes

achamayou reviewed Feb 28, 2025

View reviewed changes

src/kv/apply_changes.h Show resolved Hide resolved

achamayou approved these changes Feb 28, 2025

View reviewed changes

eddyashton mentioned this pull request Feb 28, 2025

Remove track_read_versions #6868

Closed

achamayou added this pull request to the merge queue Feb 28, 2025

Merged via the queue into microsoft:main with commit da4d63e Feb 28, 2025
21 checks passed

achamayou deleted the fix_data_race_in_apply_changes branch February 28, 2025 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix potential data race: Lock all KV maps during `apply_changes`, even those which are read-only #6866

Fix potential data race: Lock all KV maps during `apply_changes`, even those which are read-only #6866

Uh oh!

eddyashton commented Feb 28, 2025 •

edited

Loading

Uh oh!

eddyashton commented Feb 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix potential data race: Lock all KV maps during apply_changes, even those which are read-only #6866

Fix potential data race: Lock all KV maps during apply_changes, even those which are read-only #6866

Uh oh!

Conversation

eddyashton commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation

Traces

Uh oh!

eddyashton commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix potential data race: Lock all KV maps during `apply_changes`, even those which are read-only #6866

Fix potential data race: Lock all KV maps during `apply_changes`, even those which are read-only #6866

eddyashton commented Feb 28, 2025 •

edited

Loading

eddyashton commented Feb 28, 2025 •

edited

Loading