Skip to content

[Fix](fragment) avoid query-ctx map clear self-deadlock when stop FragmentMgr#62954

Open
linrrzqqq wants to merge 2 commits intoapache:masterfrom
linrrzqqq:fix-query_ctx-deadlock
Open

[Fix](fragment) avoid query-ctx map clear self-deadlock when stop FragmentMgr#62954
linrrzqqq wants to merge 2 commits intoapache:masterfrom
linrrzqqq:fix-query_ctx-deadlock

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Previously ConcurrentContextMap::clear() destroyed map values while holding the shard_mutex. For _query_ctx_map_delay_delete, releasing the last QueryContext reference can run QueryContext::~QueryContext(), which calls FragmentMgr::remove_query_context() and re-enters the same delay-delete map
through erase(). This can throw std::system_error with "Resource deadlock avoided" during BE shutdown.

terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1777288253 (unix time) try "date -d @1777288253" if you are using GNU date ***
*** Current BE git commitID: f060f176c99 ***
*** SIGABRT unknown detail explain (@0x7a53) received by PID 31315 (TID 31315 OR 0x7f6d057be200) from PID 31315; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at ../src/common/signal_handler.h:420
 1# 0x00007F6D05BD7420 in /lib/x86_64-linux-gnu/libpthread.so.0
 2# raise at ../sysdeps/unix/sysv/linux/raise.c:51
 3# abort at /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81
 4# __gnu_cxx::__verbose_terminate_handler() [clone .cold] in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
 5# __cxxabiv1::__terminate(void (*)()) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
 6# 0x00005608D3BC2AB9 in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
 7# 0x00005608A68DBCFE in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
 8# 0x00005608AAF679AE in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
 9# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use() in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
10# decltype (std::allocator_traits<std::allocator<std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > > >::destroy({parm#2}, {parm#3})) phmap::allocator_traits<std::allocator<std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > > >::destroy_impl<std::allocator<std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > >, std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > >(int, std::allocator<std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > >&, std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> >*) at /var/local/thirdparty/installed/include/parallel_hashmap/phmap_base.h:1542
11# phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<doris::TUniqueId, std::shared_ptr<doris::QueryContext> >, phmap::Hash<doris::TUniqueId>, phmap::EqualTo<doris::TUniqueId>, std::allocator<std::pair<doris::TUniqueId const, std::shared_ptr<doris::QueryContext> > > >::clear() at /var/local/thirdparty/installed/include/parallel_hashmap/phmap.h:1279
12# doris::ConcurrentContextMap<doris::TUniqueId, std::shared_ptr<doris::QueryContext>, doris::QueryContext>::clear() at /root/doris/be/build_ASAN/../src/runtime/fragment_mgr.cpp:311
13# doris::FragmentMgr::stop() at /root/doris/be/build_ASAN/../src/runtime/fragment_mgr.cpp:348
14# doris::ExecEnv::destroy() at /root/doris/be/build_ASAN/../src/runtime/exec_env_init.cpp:819
15# main in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
16# __libc_start_main at ../csu/libc-start.c:342
17# _start in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
FragmentMgr::stop()
  -> _query_ctx_map_delay_delete.clear()
      -> unique_lock(unique_query_id_lock)
      -> map.clear()
          -> QueryContext::~QueryContext()
              -> FragmentMgr::remove_query_context(query_id)
                  -> _query_ctx_map_delay_delete.erase(query_id)
                      -> unique_lock(unique_query_id_lock)        <- dead-lock happened
                      -> map.erase(query_id)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

1 similar comment
@wyxxxcat
Copy link
Copy Markdown
Collaborator

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated code review summary:

No blocking issues found in the reviewed PR diff. Existing inline review context was empty, and the user focus file provided no additional focus points.

Critical checkpoint conclusions:

  • Goal and test: the PR fixes the shutdown self-deadlock where _query_ctx_map_delay_delete.clear() can release the last QueryContext reference while holding the shard mutex, and the added death test covers that reentrant destructor path.
  • Scope: the change is small and focused on moving value destruction outside the map shard lock.
  • Concurrency: the lock scope is reduced; the reentrant QueryContext::~QueryContext() -> FragmentMgr::remove_query_context() -> delay-delete-map erase() path no longer attempts to reacquire the same shard mutex while destruction is in progress. No new lock ordering issue was identified.
  • Lifecycle: the non-intuitive lifecycle is the QueryContext destructor callback into FragmentMgr; the new swap-then-clear approach handles that lifecycle safely for the affected map.
  • Configuration, compatibility, persistence, and FE/BE protocol: not applicable; no config, storage format, edit log, or RPC/thrift compatibility changes were introduced.
  • Parallel code paths: ConcurrentContextMap::clear() is generic, so the reduced lock scope applies consistently to the existing clear users; no missed equivalent clear path was found in this PR.
  • Tests: the PR adds a BE death test for the regression. I did not run the full BE UT locally in this review runner; GitHub/TeamCity status checks are the source of verification.
  • Observability: no new observability appears necessary for this shutdown deadlock fix.
  • Performance: the extra local flat_hash_map per shard is bounded by the existing shard count and avoids holding the mutex during potentially expensive destructors, which is preferable on this path.

Focus points: no additional user-provided review focus was present.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants