Skip to content

Commit 261d62f

Browse files
RamyaGuruclaude
andauthored
#89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash (#90)
* #89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash Every Manager backend's shutdown() begins with a DAQIRI_LOG_INFO call, and is invoked twice during the typical bench lifecycle: 1. Explicitly from main() via daqiri::shutdown(). 2. Again from the manager's destructor (or, for SocketMgr in RoCE mode, a destructor cascade into RdmaMgr::shutdown()) during C++ __cxa_finalize. By the time the destructor cascade fires, spdlog's default logger -- a function-local static created lazily on the first DAQIRI_LOG_INFO -- has already been destroyed. The DAQIRI_LOG_INFO at the top of the second shutdown() call then crashes inside spdlog::sink_it_. Repro on DGX Spark: daqiri_bench_rdma --mode both against examples/daqiri_bench_rdma_tx_rx_spark.yaml segfaults immediately after the legitimate shutdown completes. Backtrace shows __cxa_finalize -> ~SocketMgr -> SocketMgr::shutdown -> RdmaMgr::shutdown -> daqiri::log_formatted_message -> spdlog::logger::log_ -> spdlog::logger::sink_it_ -> SIGSEGV. Fix: short-circuit shutdown() on subsequent calls by returning early when initialized_ is false. Applied symmetrically to RdmaMgr, DpdkMgr, and SocketMgr -- the log-first body-second pattern is identical in all three. DpdkMgr's existing num_init reference-counted body is preserved; the guard only activates after the body has cleared initialized_ in the final shutdown. Verified by repeated daqiri_bench_rdma --mode both runs from a bash-parent shell. Pre-fix: SIGSEGV 100% reproducible. Post-fix: clean exit 0, all destructor markers run in order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com> * #89 - Tighten SocketMgr::shutdown() guard to preserve init-failure cleanup Greptile review of the original idempotency commit flagged the secondary `if (!initialized_ && !running_.load()) { return; }` in SocketMgr::shutdown() as having a dead `!initialized_` clause, since the new top-of-function guard `if (!initialized_) { return; }` already covers that state. Investigation surfaced the deeper concern: the top guard was too aggressive. SocketMgr::initialize() sets initialized_=false and running_=true before running setup, then sets initialized_=true on success. If setup_tcp_endpoint or setup_udp_endpoint throws after spawning an accept_thread or io_thread, the catch-block shutdown() call entered with initialized_=false and running_=true. Under the original top guard the cleanup body was skipped, leaving the worker threads joinable on the EndpointState — the destructor cascade would then std::terminate on an unjoined std::thread. Tighten the top guard to require both flags cleared. The post-shutdown re-entry from __cxa_finalize still fires (both flags cleared at the end of the body) while the init-failure cleanup path (running_=true) falls through and joins its threads. The pre-existing secondary check is now fully redundant and removed. DpdkMgr and RdmaMgr keep the simpler `if (!initialized_) { return; }` — neither has an init-failure shutdown() caller, so the asymmetry is intentional and isolated to the manager whose initialize() partially spawns threads before setting initialized_=true. Verified manually with both the existing DPDK / socket-udp / socket-tcp normal-shutdown smokes and a new 2-endpoint UDP init-failure repro (malformed remote IP on endpoint 2 → parse_ipv4_addr throws after endpoint 1's io_thread is spawned): rc=1, no SIGSEGV / SIGABRT / "terminate called" in stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com> --------- Signed-off-by: rgurunathan <rgurunathan@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fcdbaf9 commit 261d62f

3 files changed

Lines changed: 26 additions & 2 deletions

File tree

src/managers/dpdk/daqiri_dpdk_mgr.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4403,6 +4403,11 @@ Status DpdkMgr::send_tx_burst(BurstParams* burst) {
44034403
}
44044404

44054405
void DpdkMgr::shutdown() {
4406+
// Idempotency guard: shutdown() may be invoked a second time via ~DpdkMgr
4407+
// during C++ __cxa_finalize, by which point the spdlog default logger has
4408+
// already been destroyed and any DAQIRI_LOG_INFO here crashes inside
4409+
// spdlog::sink_it_. Skip the body (and the log calls) if already torn down.
4410+
if (!initialized_) { return; }
44064411
DAQIRI_LOG_INFO("daqiri DPDK manager shutdown called {}", num_init);
44074412

44084413
if (--num_init == 0) {

src/managers/rdma/daqiri_rdma_mgr.cpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1512,6 +1512,12 @@ void RdmaMgr::initialize() {
15121512
}
15131513

15141514
void RdmaMgr::shutdown() {
1515+
// Idempotency guard: shutdown() runs explicitly from the caller AND again
1516+
// from ~RdmaMgr / ~SocketMgr during C++ __cxa_finalize. By the second call
1517+
// the spdlog default logger (a function-local static created lazily on the
1518+
// first DAQIRI_LOG_INFO) has already been destroyed, so any logging here
1519+
// crashes inside spdlog::sink_it_. Skip the whole body on subsequent calls.
1520+
if (!initialized_) { return; }
15151521
DAQIRI_LOG_INFO("RDMA manager shutting down");
15161522
rdma_force_quit.store(true);
15171523

src/managers/socket/daqiri_socket_mgr.cpp

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,21 @@ void SocketMgr::clear_rx_queues() {
466466
}
467467

468468
void SocketMgr::shutdown() {
469+
// Idempotency guard: shutdown() may be invoked a second time via
470+
// ~SocketMgr during C++ __cxa_finalize, after the spdlog default logger
471+
// has been destroyed. Any DAQIRI_LOG_INFO from the cascade (here, the
472+
// RoCE branch, or the manager method this delegates to) would then crash
473+
// inside spdlog::sink_it_. Skip the whole body on subsequent calls.
474+
//
475+
// The guard checks BOTH flags because initialize() sets initialized_=false
476+
// and running_=true before running setup, then sets initialized_=true on
477+
// success. If setup throws, the catch block calls shutdown() with
478+
// initialized_=false and running_=true to clean up any threads/sockets
479+
// that were spawned partway. Guarding on initialized_ alone would skip
480+
// that cleanup. Both flags are only cleared together after a successful
481+
// shutdown() body, so the post-shutdown re-entry from __cxa_finalize is
482+
// the only case where the guard fires.
483+
if (!initialized_ && !running_.load()) { return; }
469484
if (is_roce_protocol()) {
470485
#if DAQIRI_MGR_RDMA
471486
if (roce_mgr_ != nullptr) {
@@ -476,8 +491,6 @@ void SocketMgr::shutdown() {
476491
return;
477492
}
478493

479-
if (!initialized_ && !running_.load()) { return; }
480-
481494
running_.store(false);
482495

483496
for (auto& ep : endpoints_) {

0 commit comments

Comments
 (0)