Commit 261d62f
* #89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash
Every Manager backend's shutdown() begins with a DAQIRI_LOG_INFO call,
and is invoked twice during the typical bench lifecycle:
1. Explicitly from main() via daqiri::shutdown().
2. Again from the manager's destructor (or, for SocketMgr in RoCE mode,
a destructor cascade into RdmaMgr::shutdown()) during C++
__cxa_finalize.
By the time the destructor cascade fires, spdlog's default logger -- a
function-local static created lazily on the first DAQIRI_LOG_INFO -- has
already been destroyed. The DAQIRI_LOG_INFO at the top of the second
shutdown() call then crashes inside spdlog::sink_it_.
Repro on DGX Spark: daqiri_bench_rdma --mode both against
examples/daqiri_bench_rdma_tx_rx_spark.yaml segfaults immediately after
the legitimate shutdown completes. Backtrace shows
__cxa_finalize -> ~SocketMgr -> SocketMgr::shutdown ->
RdmaMgr::shutdown -> daqiri::log_formatted_message ->
spdlog::logger::log_ -> spdlog::logger::sink_it_ -> SIGSEGV.
Fix: short-circuit shutdown() on subsequent calls by returning early
when initialized_ is false. Applied symmetrically to RdmaMgr, DpdkMgr,
and SocketMgr -- the log-first body-second pattern is identical in all
three. DpdkMgr's existing num_init reference-counted body is preserved;
the guard only activates after the body has cleared initialized_ in
the final shutdown.
Verified by repeated daqiri_bench_rdma --mode both runs from a
bash-parent shell. Pre-fix: SIGSEGV 100% reproducible. Post-fix: clean
exit 0, all destructor markers run in order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
* #89 - Tighten SocketMgr::shutdown() guard to preserve init-failure cleanup
Greptile review of the original idempotency commit flagged the secondary
`if (!initialized_ && !running_.load()) { return; }` in SocketMgr::shutdown()
as having a dead `!initialized_` clause, since the new top-of-function guard
`if (!initialized_) { return; }` already covers that state. Investigation
surfaced the deeper concern: the top guard was too aggressive.
SocketMgr::initialize() sets initialized_=false and running_=true before
running setup, then sets initialized_=true on success. If setup_tcp_endpoint
or setup_udp_endpoint throws after spawning an accept_thread or io_thread,
the catch-block shutdown() call entered with initialized_=false and
running_=true. Under the original top guard the cleanup body was skipped,
leaving the worker threads joinable on the EndpointState — the destructor
cascade would then std::terminate on an unjoined std::thread.
Tighten the top guard to require both flags cleared. The post-shutdown
re-entry from __cxa_finalize still fires (both flags cleared at the end of
the body) while the init-failure cleanup path (running_=true) falls through
and joins its threads. The pre-existing secondary check is now fully
redundant and removed.
DpdkMgr and RdmaMgr keep the simpler `if (!initialized_) { return; }` —
neither has an init-failure shutdown() caller, so the asymmetry is
intentional and isolated to the manager whose initialize() partially
spawns threads before setting initialized_=true.
Verified manually with both the existing DPDK / socket-udp / socket-tcp
normal-shutdown smokes and a new 2-endpoint UDP init-failure repro
(malformed remote IP on endpoint 2 → parse_ipv4_addr throws after
endpoint 1's io_thread is spawned): rc=1, no SIGSEGV / SIGABRT /
"terminate called" in stderr.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
---------
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fcdbaf9 commit 261d62f
3 files changed
Lines changed: 26 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4403 | 4403 | | |
4404 | 4404 | | |
4405 | 4405 | | |
| 4406 | + | |
| 4407 | + | |
| 4408 | + | |
| 4409 | + | |
| 4410 | + | |
4406 | 4411 | | |
4407 | 4412 | | |
4408 | 4413 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1512 | 1512 | | |
1513 | 1513 | | |
1514 | 1514 | | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
1515 | 1521 | | |
1516 | 1522 | | |
1517 | 1523 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
466 | 466 | | |
467 | 467 | | |
468 | 468 | | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
469 | 484 | | |
470 | 485 | | |
471 | 486 | | |
| |||
476 | 491 | | |
477 | 492 | | |
478 | 493 | | |
479 | | - | |
480 | | - | |
481 | 494 | | |
482 | 495 | | |
483 | 496 | | |
| |||
0 commit comments