Skip to content

dpdk-worker3 segfaulting on io-engine nodes #1876

@mattlqx

Description

@mattlqx

Describe the bug
I have a 3 node cluster that after doing an upgrade to 4.3.0 release. I also did some cleanup of some stale diskpools and replicas in etcd and added some new diskpools. Not sure which is the cause, but my two of my io-engine pods crash frequently now and I can see segfaults for dpdk-worker3 processes in the kernel messages.

2025-06-15T21:13:09.676348+00:00 rosey kernel: dpdk-worker3[671951]: segfault at 5b8 ip 00007d5a291b13a4 sp 00007d5a137fbc18 error 4 in libc.so.6[7d5a29148000+15c000] likely on CPU 3 (core 4, socket 0)
2025-06-15T21:30:29.525325+00:00 rosey kernel: dpdk-worker3[701548]: segfault at e8 ip 00006339f3463ffe sp 00007274df7fbb70 error 4 in io-engine[6339f22cb000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-15T21:48:19.732350+00:00 rosey kernel: dpdk-worker3[730617]: segfault at e8 ip 00005e22561e4ffe sp 00007a6913ffcb70 error 4 in io-engine[5e225504c000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-15T22:00:28.440310+00:00 rosey kernel: dpdk-worker3[760210]: segfault at 18 ip 000063dc602b0bce sp 000070aa71622bc0 error 4 in io-engine[63dc5f11c000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-16T13:21:02.272536+00:00 rosey kernel: dpdk-worker3[1561562]: segfault at e8 ip 00005d166445affe sp 00007c628b7fbb70 error 4 in io-engine[5d16632c2000+1242000] likely on CPU 3 (core 4, socket 0)

Here's also a tail of a pod log that crashed...

2025-06-16T16:38:21.794244627+00:00 DEBUG io_engine::bdev::nvmx::controller:controller.rs:708] detaching NVMe controller self.name="192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1"
[2025-06-16T16:38:21.794261091+00:00 ERROR mayastor::spdk:nvme_tcp.c:2176] Failed to flush tqpair=0x5e18b4dce6f0 (9): Bad file descriptor   
[2025-06-16T16:38:21.794272191+00:00 ERROR mayastor::spdk:nvme_fabric.c:214] Failed to send Property Get fabrics command   
[2025-06-16T16:38:21.794276201+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:1249] [nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ea] Failed to read the CC register   
[2025-06-16T16:38:21.794294917+00:00  INFO io_engine::bdev::nvmx::controller:controller.rs:715] NVMe controller successfully detached self.name="192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1"
[2025-06-16T16:38:21.794303307+00:00 DEBUG io_engine::bdev::nvmx::controller_inner:controller_inner.rs:70] 192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1 dropping TimeoutConfig
[2025-06-16T16:38:21.794327685+00:00 ERROR io_engine::bdev::nvmx::controller:controller.rs:792] process adminq: A?H??^??^@?: ctrl failed: false, error: ENXIO: No such device or address
[2025-06-16T16:38:21.794350428+00:00  INFO io_engine::bdev::nvmx::controller:controller.rs:798] dispatching nexus fault and retire: A?H??^??^@?
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::unwrap_failed
   4: io_engine::bdev::nvmx::controller::NvmeControllerInner::new::{{closure}}
   5: spdk_rs::poller::inner_poller_cb
   6: thread_poll
   7: spdk_thread_poll
   8: io_engine::core::reactor::Reactor::poll_once
   9: io_engine::core::reactor::Reactor::poll_reactor
  10: io_engine::core::reactor::Reactor::poll
  11: eal_thread_loop
  12: eal_worker_thread_loop
  13: start_thread
  14: clone3
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at core/src/panicking.rs:221:5:
panic in a function that cannot unwind
stack backtrace:
   0:     0x5e18a4d3298c - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h304520fd6a30aa07
   1:     0x5e18a42d620b - core::fmt::write::hf5713710ce10ff22
   2:     0x5e18a4cfc722 - std::io::Write::write_fmt::hda708db57927dacf
   3:     0x5e18a4d39c96 - std::panicking::default_hook::{{closure}}::he1ad87607d0c11c5
   4:     0x5e18a4d3abdb - std::panicking::rust_panic_with_hook::had2118629c312a4a
   5:     0x5e18a4d3a682 - std::panicking::begin_panic_handler::{{closure}}::h7fa5985d111bafa2
   6:     0x5e18a4d3a619 - std::sys::backtrace::__rust_end_short_backtrace::h704d151dbefa09c5
   7:     0x5e18a4d3a604 - rust_begin_unwind
   8:     0x5e18a42dd284 - core::panicking::panic_nounwind_fmt::hc0ae93930ea8f76c
   9:     0x5e18a42dd2e5 - core::panicking::panic_nounwind::h9f485ff9b02bac75
  10:     0x5e18a42dd2a0 - core::panicking::panic_cannot_unwind::hea865182d7ce50af
  11:     0x5e18a4463dd4 - io_engine::bdev::nvmx::controller::NvmeControllerInner::new::{{closure}}::hba023daa35c6c67f
  12:     0x5e18a4447e48 - spdk_rs::poller::inner_poller_cb::hb52f9bac7c27ff23
  13:     0x5e18a5182b1d - thread_poll
  14:     0x5e18a518394f - spdk_thread_poll
  15:     0x5e18a45e634c - io_engine::core::reactor::Reactor::poll_once::h54fa92eac4aa48d0
  16:     0x5e18a4645371 - io_engine::core::reactor::Reactor::poll_reactor::h87eb72f03aba2253
  17:     0x5e18a4644d10 - io_engine::core::reactor::Reactor::poll::h3da10718b44758c3
  18:     0x5e18a4fd314e - eal_thread_loop
  19:     0x5e18a4fe7459 - eal_worker_thread_loop
  20:     0x7145128c0272 - start_thread
  21:     0x71451293bdec - clone3
  22:                0x0 - <unknown>
thread caused non-unwinding panic. aborting.

This is happening on two of the three nodes. The affected hosts also are panicking from time to time.

I'd love some assistance in troubleshooting this.

** OS info (please complete the following information):**

  • Distro: Ubuntu 24.04
  • Kernel version: 6.8.0-60-generic
  • MayaStor revision or container image: docker.io/openebs/mayastor-io-engine:v2.9.0

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions