-
Notifications
You must be signed in to change notification settings - Fork 124
Open
Labels
kind/bugCategorizes issue or PR as related to a bugCategorizes issue or PR as related to a bug
Description
Describe the bug
I have a 3 node cluster that after doing an upgrade to 4.3.0 release. I also did some cleanup of some stale diskpools and replicas in etcd and added some new diskpools. Not sure which is the cause, but my two of my io-engine pods crash frequently now and I can see segfaults for dpdk-worker3 processes in the kernel messages.
2025-06-15T21:13:09.676348+00:00 rosey kernel: dpdk-worker3[671951]: segfault at 5b8 ip 00007d5a291b13a4 sp 00007d5a137fbc18 error 4 in libc.so.6[7d5a29148000+15c000] likely on CPU 3 (core 4, socket 0)
2025-06-15T21:30:29.525325+00:00 rosey kernel: dpdk-worker3[701548]: segfault at e8 ip 00006339f3463ffe sp 00007274df7fbb70 error 4 in io-engine[6339f22cb000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-15T21:48:19.732350+00:00 rosey kernel: dpdk-worker3[730617]: segfault at e8 ip 00005e22561e4ffe sp 00007a6913ffcb70 error 4 in io-engine[5e225504c000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-15T22:00:28.440310+00:00 rosey kernel: dpdk-worker3[760210]: segfault at 18 ip 000063dc602b0bce sp 000070aa71622bc0 error 4 in io-engine[63dc5f11c000+1242000] likely on CPU 3 (core 4, socket 0)
2025-06-16T13:21:02.272536+00:00 rosey kernel: dpdk-worker3[1561562]: segfault at e8 ip 00005d166445affe sp 00007c628b7fbb70 error 4 in io-engine[5d16632c2000+1242000] likely on CPU 3 (core 4, socket 0)
Here's also a tail of a pod log that crashed...
2025-06-16T16:38:21.794244627+00:00 DEBUG io_engine::bdev::nvmx::controller:controller.rs:708] detaching NVMe controller self.name="192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1"
[2025-06-16T16:38:21.794261091+00:00 ERROR mayastor::spdk:nvme_tcp.c:2176] Failed to flush tqpair=0x5e18b4dce6f0 (9): Bad file descriptor
[2025-06-16T16:38:21.794272191+00:00 ERROR mayastor::spdk:nvme_fabric.c:214] Failed to send Property Get fabrics command
[2025-06-16T16:38:21.794276201+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:1249] [nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ea] Failed to read the CC register
[2025-06-16T16:38:21.794294917+00:00 INFO io_engine::bdev::nvmx::controller:controller.rs:715] NVMe controller successfully detached self.name="192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1"
[2025-06-16T16:38:21.794303307+00:00 DEBUG io_engine::bdev::nvmx::controller_inner:controller_inner.rs:70] 192.168.143.27:8420/nqn.2019-05.io.openebs:2bb3e6fa-14b7-4d42-8ba1-d4193374e3ean1 dropping TimeoutConfig
[2025-06-16T16:38:21.794327685+00:00 ERROR io_engine::bdev::nvmx::controller:controller.rs:792] process adminq: A?H??^??^@?: ctrl failed: false, error: ENXIO: No such device or address
[2025-06-16T16:38:21.794350428+00:00 INFO io_engine::bdev::nvmx::controller:controller.rs:798] dispatching nexus fault and retire: A?H??^??^@?
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic
3: core::option::unwrap_failed
4: io_engine::bdev::nvmx::controller::NvmeControllerInner::new::{{closure}}
5: spdk_rs::poller::inner_poller_cb
6: thread_poll
7: spdk_thread_poll
8: io_engine::core::reactor::Reactor::poll_once
9: io_engine::core::reactor::Reactor::poll_reactor
10: io_engine::core::reactor::Reactor::poll
11: eal_thread_loop
12: eal_worker_thread_loop
13: start_thread
14: clone3
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at core/src/panicking.rs:221:5:
panic in a function that cannot unwind
stack backtrace:
0: 0x5e18a4d3298c - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h304520fd6a30aa07
1: 0x5e18a42d620b - core::fmt::write::hf5713710ce10ff22
2: 0x5e18a4cfc722 - std::io::Write::write_fmt::hda708db57927dacf
3: 0x5e18a4d39c96 - std::panicking::default_hook::{{closure}}::he1ad87607d0c11c5
4: 0x5e18a4d3abdb - std::panicking::rust_panic_with_hook::had2118629c312a4a
5: 0x5e18a4d3a682 - std::panicking::begin_panic_handler::{{closure}}::h7fa5985d111bafa2
6: 0x5e18a4d3a619 - std::sys::backtrace::__rust_end_short_backtrace::h704d151dbefa09c5
7: 0x5e18a4d3a604 - rust_begin_unwind
8: 0x5e18a42dd284 - core::panicking::panic_nounwind_fmt::hc0ae93930ea8f76c
9: 0x5e18a42dd2e5 - core::panicking::panic_nounwind::h9f485ff9b02bac75
10: 0x5e18a42dd2a0 - core::panicking::panic_cannot_unwind::hea865182d7ce50af
11: 0x5e18a4463dd4 - io_engine::bdev::nvmx::controller::NvmeControllerInner::new::{{closure}}::hba023daa35c6c67f
12: 0x5e18a4447e48 - spdk_rs::poller::inner_poller_cb::hb52f9bac7c27ff23
13: 0x5e18a5182b1d - thread_poll
14: 0x5e18a518394f - spdk_thread_poll
15: 0x5e18a45e634c - io_engine::core::reactor::Reactor::poll_once::h54fa92eac4aa48d0
16: 0x5e18a4645371 - io_engine::core::reactor::Reactor::poll_reactor::h87eb72f03aba2253
17: 0x5e18a4644d10 - io_engine::core::reactor::Reactor::poll::h3da10718b44758c3
18: 0x5e18a4fd314e - eal_thread_loop
19: 0x5e18a4fe7459 - eal_worker_thread_loop
20: 0x7145128c0272 - start_thread
21: 0x71451293bdec - clone3
22: 0x0 - <unknown>
thread caused non-unwinding panic. aborting.
This is happening on two of the three nodes. The affected hosts also are panicking from time to time.
I'd love some assistance in troubleshooting this.
** OS info (please complete the following information):**
- Distro: Ubuntu 24.04
- Kernel version:
6.8.0-60-generic - MayaStor revision or container image:
docker.io/openebs/mayastor-io-engine:v2.9.0
everythings-gonna-be-alright
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bugCategorizes issue or PR as related to a bug