Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential data race: Lock all KV maps during apply_changes, even those which are read-only #6866

Merged
merged 2 commits into from
Feb 28, 2025

Conversation

eddyashton
Copy link
Member

@eddyashton eddyashton commented Feb 28, 2025

Taking another look at the still-failing tests in #6616, I managed to parse one of the consistent error stacks and I think this is the right fix.

Explanation

A pseudocode summary of this apply_changes function, which is taking a Tx-owned collection of local maybe-modified views over maps and applying them to the local persistent KV if they're still valid:

  bool any_writes = false;
  for (change: changes)
  {
    if (change.has_writes())
    {
      any_writes = true;
      change.map.lock();
    }
  }

  bool ok = true; // Becomes false if any prepare finds conflicts
  if (any_writes)
  {
    for (change: changes)
    {
      if (!change.prepare()) // Check for conflicts with the current KV state
      {
        ok = false;
        break;
      }
    }
  }

  if (ok)
  {
    for (change: changes)
    {
      change.commit(); // Apply the writes to the current KV state
    }
  }

  for (change: changes)
  {
    if (change.has_writes())
    {
      change.map.unlock();
    }
  }

Most of the ifs in this are about checking for writes. If we have a read-only transaction, then it can have no conflicts*, and we can skip the prepare() step entirely. Since we're not calling prepare() to read from the map, and commit() won't affect the map, we avoid locking these read-only views.

But there's a bug in this thinking - there's a difference between having a read-only view/changeset (change.has_writes()) and a read-only transaction (has_writes==false, all change.has_writes() == false). Writing transactions may contain read-only views! And in those transactions, we still need to run the prepare() check for the read-only views (my write to table A could depend on reading foo from table B, so we need to confirm that B still contains foo).

We previously did call prepare() for all changes, but without locking the read-only changes. That means we're reading the underlying Map memory without a lock, and racing with any other transactions which may be writing to the same Map.

This PR simply makes the lock() (and unlock()) unconditional - we lock all maps. There may be a marginally narrower use of this where we continue to avoid the locks for read-only transactions, but we can't know that until we've iterated over all changes, and there's extra complication from the track_read_version flag. I think this would be a minor optimisation, not worth the complexity.

* If you run a read-only transaction, and read a bunch of state at seqno=15, then even if the KV has been updated to store to seqno=20 by the time you get here, you have no problem - your transaction reports that it read at seqno=15, and that state is still valid.

Traces

Dumping some TSAN trace walk-through here for posterity.

Full trace is here, but 1400 lines of verbose call-stack: tsan_trace.txt

There are 5 data-race warnings. The last 4 are very similar, the first looks a little different but I think has the same root cause. Trimming to the relevant bits of the second:

WARNING: ThreadSanitizer: data race (pid=52264)
  Read of size 8 at 0x7b1c0000b0c8 by thread T4 (mutexes: write M0):
    #0 std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1296:16 (liblogging.virtual.so+0xf901a9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #1 std::__shared_ptr_access<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:993:66 (liblogging.virtual.so+0xf90165) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::__shared_ptr_access<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:987:9 (liblogging.virtual.so+0xf8ff25) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>::get(llvm_vecsmall::SmallVector<unsigned char, 8u> const&) const /data/src/3.CCF/build.tsan/../src/ds/champ_map.h:424:16 (liblogging.virtual.so+0xf95741) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::HandleCommitter::prepare(bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:170:40 (liblogging.virtual.so+0xf944f1) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::apply_changes(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, ccf::kv::MapChanges, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, ccf::kv::MapChanges>>>&, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::vector<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>, std::allocator<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>>>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::shared_ptr<ccf::kv::AbstractMap>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, std::shared_ptr<ccf::kv::AbstractMap>>>> const&, std::optional<unsigned long> const&, bool, bool, std::optional<unsigned long> const&) /data/src/3.CCF/build.tsan/../src/kv/apply_changes.h:98:26 (liblogging.virtual.so+0xebfe64) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #6 ccf::kv::CommittableTx::commit(ccf::ClaimsDigest const&, bool, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::function<void (std::vector<unsigned char, std::allocator<unsigned char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&)>) /data/src/3.CCF/build.tsan/../src/kv/committable_tx.h:163:16 (liblogging.virtual.so+0x108179c) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Previous write of size 8 at 0x7b1c0000b0c8 by thread T2 (mutexes: write M1, write M2, write M3, write M4, write M5):
    #0 std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, (__gnu_cxx::_Lock_policy)2>&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1162:9 (liblogging.virtual.so+0xf8b681) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #1 std::shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>::shared_ptr(std::shared_ptr<champ::SubNodes<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:305:9 (liblogging.virtual.so+0xf8b615) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>::Map(champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&) /data/src/3.CCF/build.tsan/../src/ds/champ_map.h:384:9 (liblogging.virtual.so+0xf8b559) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::LocalCommit::LocalCommit(unsigned long, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>> const&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:26:7 (liblogging.virtual.so+0xf8b4a1) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::LocalCommit* ccf::kv::untyped::Roll::create_new_local_commit<unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&>(unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:62:17 (liblogging.virtual.so+0xf97126) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::untyped::Map::HandleCommitter::commit(unsigned long, bool, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:272:50 (liblogging.virtual.so+0xf95441) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #6 ccf::kv::apply_changes(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, ccf::kv::MapChanges, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, ccf::kv::MapChanges>>>&, std::function<std::tuple<unsigned long, unsigned long> (bool)>, std::vector<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>, std::allocator<std::unique_ptr<ccf::kv::ConsensusHook, std::default_delete<ccf::kv::ConsensusHook>>>>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::shared_ptr<ccf::kv::AbstractMap>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, std::shared_ptr<ccf::kv::AbstractMap>>>> const&, std::optional<unsigned long> const&, bool, bool, std::optional<unsigned long> const&) /data/src/3.CCF/build.tsan/../src/kv/apply_changes.h:157:23 (liblogging.virtual.so+0xec03d9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #7 ccf::kv::ReservedTx::commit_reserved() /data/src/3.CCF/build.tsan/../src/kv/committable_tx.h:427:16 (liblogging.virtual.so+0x13af49a) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Location is heap block of size 104 at 0x7b1c0000b0c0 allocated by thread T2:
    #0 operator new(unsigned long) <null> (cchost+0xe7f96) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 ccf::kv::untyped::LocalCommit* ccf::kv::untyped::Roll::create_new_local_commit<unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&>(unsigned long&, champ::Map<llvm_vecsmall::SmallVector<unsigned char, 8u>, ccf::kv::VersionV<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::hash<llvm_vecsmall::SmallVector<unsigned char, 8u>>>&&, std::map<llvm_vecsmall::SmallVector<unsigned char, 8u>, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::less<llvm_vecsmall::SmallVector<unsigned char, 8u>>, std::allocator<std::pair<llvm_vecsmall::SmallVector<unsigned char, 8u> const, std::optional<llvm_vecsmall::SmallVector<unsigned char, 8u>>>>>&) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:57:13 (liblogging.virtual.so+0xf97096) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M0 (0x7b40000084c8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M1 (0x7b4800001350) created at:
    #0 pthread_mutex_trylock <null> (cchost+0x6a0ca) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_trylock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:758:12 (liblogging.virtual.so+0xebf533) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::try_lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:111:15 (liblogging.virtual.so+0xf36445) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 std::unique_lock<std::mutex>::try_lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_lock.h:153:27 (liblogging.virtual.so+0x13ad085) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::'lambda'(std::unique_ptr<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>, std::default_delete<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>>>)::operator()(std::unique_ptr<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>, std::default_delete<threading::Tmsg<ccf::HashedTxHistory<ccf::MerkleTreeHistory>::start_signature_emit_timer()::EmitSigMsg>>>) const /data/src/3.CCF/build.tsan/../src/node/history.h:664:22 (liblogging.virtual.so+0x13b5695) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M2 (0x7b50000026b0) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229:19 (liblogging.virtual.so+0xd84bd9) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::Store::commit(ccf::kv::TxID const&, std::unique_ptr<ccf::kv::PendingTx, std::default_delete<ccf::kv::PendingTx>>, bool) /data/src/3.CCF/build.tsan/../src/kv/store.h:885:40 (liblogging.virtual.so+0xf5d1ea) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M3 (0x7b4000008dc8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M4 (0x7b40000085c8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

  Mutex M5 (0x7b4000008cc8) created at:
    #0 pthread_mutex_lock <null> (cchost+0x8632a) (BuildId: 027e3c092a05647758acc60152d08eb86ddc7117)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749:12 (liblogging.virtual.so+0xd32213) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #2 std::mutex::lock() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100:17 (liblogging.virtual.so+0xd84cb5) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #3 ccf::kv::untyped::Map::lock() /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:778:10 (liblogging.virtual.so+0xf89bbc) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #4 ccf::kv::untyped::Map::create_change_set(unsigned long, bool) /data/src/3.CCF/build.tsan/../src/kv/untyped_map.h:799:7 (liblogging.virtual.so+0xfbaa47) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    #5 ccf::kv::BaseTx::get_map_and_change_set_by_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) /data/src/3.CCF/build.tsan/../src/kv/tx.cpp:95:20 (liblogging.virtual.so+0x18671e2) (BuildId: ea2a7594f9ba9e423b20f800ddfd0cf17460b982)
    ...

Mutexes M0, M3, M4, and M5 are mutexes within each Map (created within create_change_set when that Map was first accessed). Thread T2 is going through the signature commit path, so also holds store/sig related mutexes M1 and M2.

We have a data race accessing some champ::SubNodes fields, both under the call-stack of this apply_changes function. Both threads hold some locks, but none of the same, so they race. The commit() side must have some writes, as it's creating new entries, so it should have locked all of the maps it's handling, and be strictly ordered against any nearby prepare()s from other transactions? But not for a map it considered read-only!

@eddyashton eddyashton requested a review from a team as a code owner February 28, 2025 14:21
@eddyashton
Copy link
Member Author

eddyashton commented Feb 28, 2025

Validation of this fix: I'm running several e2e tests locally which were consistently finding these races before, and are green with this change. It looks (early doors) like the remaining TSAN failures are only timeouts (for tests where we try too much in parallel), rather than TSAN warnings/errors.

EDIT: The claim this was the last data race was too optimistic - #6616 has found at least one more around ledger file access, and likely a handful of others.

@eddyashton eddyashton added the run-long-test Run Long Test job label Feb 28, 2025
@achamayou achamayou added this pull request to the merge queue Feb 28, 2025
Merged via the queue into microsoft:main with commit da4d63e Feb 28, 2025
21 checks passed
@achamayou achamayou deleted the fix_data_race_in_apply_changes branch February 28, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-long-test Run Long Test job
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants