Skip to content

tst-pin.cc hangs forever in sleep() or crashes after enough attempts #1357

Open
@wkozaczuk

Description

@wkozaczuk

After enough runs (7-9K, but in one instance, it was 42 times) on my fastest ARM machine - Mac mini M2, the tst-pin.cc would hang like this:


(gdb) osv info threads --timers
  id  address             name            cpu  status         vruntime  total_cpu_time location
   1 (0x400040771040) reclaimer       cpu0 waiting     4.78214e-25             833 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:53
   2 (0x400040857040) page_pool_l2    cpu0 waiting     2.04507e-22          309708 memory::page_pool::l2::fill_thread() at core/mempool.cc:1525
   3 (0x400040882040) itimer-real     cpu? unstarted             0               0 ?? at arch/aarch64/entry.S:135
   4 (0x4000408a4040) itimer-virt     cpu? unstarted             0               0 ?? at arch/aarch64/entry.S:135
   5 (0x4000408e7040) balancer0       cpu0 waiting      1.6479e-23           18625 ?? at arch/aarch64/entry.S:135
    timers: 0.107022875
   6 (0x400040909040) rcu0            cpu0 waiting     8.27181e-25            1417 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
   7 (0x40004092e040) page_pool_l1_0  cpu0 waiting     1.15598e-22          110247 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
   8 (0x400040950040) percpu0         cpu0 waiting      2.1972e-25             375 workman::call_of_duty() at core/percpu-worker.cc:92
   9 (0x400040973040) async_worker0   cpu0 waiting     2.45569e-25             416 async::async_worker::run() at core/async.cc:177
  10 (0x400040996040) idle0           cpu0 running             inf  17085277590378 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/14/bits/std_function.h:290
  11 (0x4000409b8040) idle1           cpu1 running             inf  17086932350426 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/14/bits/std_function.h:290
  12 (0x400040a01040) idle2           cpu2 running             inf  17086912847839 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/14/bits/std_function.h:290
  13 (0x400040a4a040) idle3           cpu3 running             inf  17086897666638 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/14/bits/std_function.h:290
  14 (0x400040a93040) >init           cpu0 waiting     6.15216e-24             708 std::_Function_handler<void(), sched::thread::reaper::reaper()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/14/bits/std_function.h:290
  15 (0x400040aba040) rcu3            cpu3 waiting     6.44942e-24            1625 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  16 (0x400040add040) thread taskq    cpu0 waiting     6.29433e-24             958 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
  17 (0x400040af5040) page_pool_l1_3  cpu3 waiting      6.3848e-24            1500 memory::page_pool::l1::fill_thread() at core/mempool.cc:1407
  18 (0x400040b06040) callout         cpu0 waiting     3.74299e-23            4791 _callout_thread() at bsd/porting/callout.cc:138
    timers: 0.207748666
  19 (0x400040b3a040) percpu3         cpu3 waiting     6.24263e-24            1250 workman::call_of_duty() at core/percpu-worker.cc:92
  20 (0x400040b5d040) async_worker3   cpu3 waiting     6.17801e-24            1125 async::async_worker::run() at core/async.cc:177
  21 (0x400040b84040) rcu1            cpu1 waiting     6.34603e-24            1417 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  22 (0x400040bae040) page_pool_l1_1  cpu1 waiting     2.33508e-21         3944959 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
  23 (0x400040bd4040) percpu1         cpu1 waiting     6.11338e-24            1000 workman::call_of_duty() at core/percpu-worker.cc:92
  24 (0x400040bfc040) async_worker1   cpu1 waiting     6.16508e-24            1083 async::async_worker::run() at core/async.cc:177
  25 (0x400040c2e040) rcu2            cpu2 waiting     5.82904e-24             500 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  26 (0x400040c56040) page_pool_l1_2  cpu2 waiting     8.11542e-23          105832 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
  27 (0x400040c78040) percpu2         cpu2 waiting     6.17801e-24            1083 workman::call_of_duty() at core/percpu-worker.cc:92
  28 (0x400040d3f040) async_worker2   cpu2 waiting     6.21678e-24            1167 async::async_worker::run() at core/async.cc:177
  29 (0x4000412a4040) netisr          cpu0 waiting     1.28988e-23            1999 netisr_osv_thread_wrapper(void (*)(void*), void*) at bsd/sys/net/netisr1.cc:26
  30 (0x400041355040) >init           cpu0 waiting     3.01762e-21         5137084 sched::thread::join() at core/sched.cc:1565
	joining on <optimized out>
  31 (0x400041371040) virtio-net-rx   cpu0 waiting     3.42685e-27          338039 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:159
  32 (0x4000413a4040) virtio-tx-0     cpu0 waiting      2.9737e-30             500 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  33 (0x4000413d7040) virtio-tx-1     cpu1 waiting     1.05974e-29            1875 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<o--Type <RET> for more, q to quit, c to continue without paging--c
sv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  34 (0x40004140a040) virtio-tx-2     cpu2 waiting     1.15012e-29            2000 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  35 (0x40004143d040) virtio-tx-3     cpu3 waiting     6.33219e-30            1125 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  36 (0x40004146f040) virtio-blk      cpu0 waiting     4.61412e-24            7958 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:159
  37 (0x400041711040) virtio-rng      cpu0 waiting     1.47342e-24            2499 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:53
  38 (0x400041736040) pl011-input     cpu0 waiting     1.82755e-23           30834 console::LineDiscipline::read_poll(console::console_driver*) at drivers/line-discipline.cc:86
  39 (0x400041781040) rand_harvestq   cpu0 waiting     8.65955e-25            1458 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
    timers: 0.112718500
  40 (0x4000417b1040) dhcp            cpu0 waiting      6.4197e-23          108126 dhcp::dhcp_worker::dhcp_worker_fn() at core/dhcp.cc:830
  41 (0x400041809040) /tests/tst-pin. cpu0 waiting     5.59126e-23           82792 main(int, char**) at /usr/include/c++/14/bits/chrono.h:573
    timers: 0.022137000
  44 (0x400041823040) >/tests/tst-pin cpu1 terminated  1.76604e-24            3000 ?? at arch/aarch64/entry.S:135
  46 (0x400041845040) >/tests/tst-pin cpu0 waiting     8.02762e-20       100013125 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:53
Number of threads: 43

(gdb) p 'sched::cpus'[0]->timers._list._non_empty_buckets
$1 = std::bitset = {[35] = 1, [36] = 1, [38] = 1}
(gdb) p 'sched::cpus'[1]->timers._list._non_empty_buckets
$2 = std::bitset = {[36] = 1}
(gdb) p 'sched::cpus'[1]->timers._last
$3 = osv::clock::uptime time_point = { 17087496162125ns }
(gdb) p 'sched::cpus'[0]->timers._last
$4 = osv::clock::uptime time_point = { 18658549ns }
(gdb) p 'sched::cpus'[2]->timers._last
$5 = osv::clock::uptime time_point = { 17087495597291ns }
(gdb) p 'sched::cpus'[3]->timers._last
$6 = osv::clock::uptime time_point = { 17087495608666ns }
(gdb) p 'sched::cpus'[0]->timers._list._last
$7 = 13595583
(gdb) p 'sched::cpus'[0]->timers._list._next
$8 = 18658549

(gdb) osv thread 41
(gdb) bt
#0  reschedule_from_interrupt () at arch/aarch64/sched.S:72
#1  0x0000000fc0244d9c in sched::cpu::schedule () at core/sched.cc:238
#2  0x0000000fc0246bd8 in sched::thread::wait (this=0x0) at core/sched.cc:1425
#3  sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, sched::thread::sleep_impl(sched::timer&)::<lambda()> > (mtx=<synthetic pointer>..., pred=...)
    at include/osv/sched.hh:1250
#4  sched::thread::wait_until<sched::thread::sleep_impl(sched::timer&)::<lambda()> > (pred=...) at include/osv/sched.hh:1261
#5  sched::thread::sleep_impl (t=...) at core/sched.cc:1625
#6  0x0000100000002a74 in sched::thread::sleep<long, std::ratio<1l, 1000l> > (duration=duration@entry=std::chrono::duration = { 1ms })
    at /home/wkozaczuk/projects/osv-master/include/osv/sched.hh:1450
#7  0x0000100000001884 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/14/bits/chrono.h:573
#8  0x0000000fc0263988 in osv::application::run_main (this=0x600040edea10) at core/app.cc:467
#9  0x0000000fc0263b58 in operator() (__closure=0x0, app=<optimized out>) at core/app.cc:248
#10 _FUN () at core/app.cc:250
#11 0x0000000fc029a484 in operator() (__closure=<optimized out>) at libc/pthread.cc:119
#12 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...)
    at /usr/include/c++/14/bits/invoke.h:61
#13 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
    at /usr/include/c++/14/bits/invoke.h:150
#14 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/14/bits/std_function.h:290
#15 0x0000000fc02494c4 in sched::thread::main (this=0x400041809040) at core/sched.cc:1419
#16 sched::thread_main_c (t=0x400041809040) at arch/aarch64/arch-switch.hh:189
#17 0x0000000fc0173788 in thread_main () at arch/aarch64/entry.S:135

(gdb) osv thread 44
(gdb) bt
#0  reschedule_from_interrupt () at arch/aarch64/sched.S:72
#1  0x0000000fc0247e5c in sched::cpu::schedule () at include/osv/sched.hh:1516
#2  sched::thread::complete (this=0x400040784738, this@entry=0x400041823040) at core/sched.cc:1518
#3  0x0000000fc02494cc in sched::thread_main_c (t=0x400041823040) at arch/aarch64/arch-switch.hh:190
#4  0x0000000fc0173788 in thread_main () at arch/aarch64/entry.S:135

(gdb) osv thread 46
(gdb) bt
#0  reschedule_from_interrupt () at arch/aarch64/sched.S:72
#1  0x0000000fc0244d9c in sched::cpu::schedule () at core/sched.cc:238
#2  0x0000000fc0246584 in sched::thread::wait (this=this@entry=0x400041845040) at core/sched.cc:1425
#3  0x0000000fc0226828 in sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, waiter::wait(sched::timer*) const::{lambda()#1}>(sched::thread::dummy_lock&, waiter::wait(sched::timer*) const::{lambda()#1}) (mtx=<synthetic pointer>..., pred=...) at include/osv/sched.hh:1250
#4  sched::thread::wait_until<waiter::wait(sched::timer*) const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1}) (pred=...) at include/osv/sched.hh:1261
#5  waiter::wait (this=0x400040783040, tmr=0x400040784738) at include/osv/wait_record.hh:46
#6  condvar::wait (this=0x4000408e4cf0, user_mutex=0x600040770890, tmr=0x400040784738, tmr@entry=0x0) at core/condvar.cc:53
#7  0x00001000000028c0 in condvar::wait (this=<optimized out>, user_mutex=..., tmr=0x0) at /home/wkozaczuk/projects/osv-master/include/osv/condvar.h:226
#8  operator() (__closure=0x600040ed25a0) at /home/wkozaczuk/projects/osv-master/tests/tst-pin.cc:99
#9  std::__invoke_impl<void, main(int, char**)::<lambda()>&> (__f=...) at /usr/include/c++/14/bits/invoke.h:61
#10 std::__invoke_r<void, main(int, char**)::<lambda()>&> (__fn=...) at /usr/include/c++/14/bits/invoke.h:150
#11 std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/14/bits/std_function.h:290
#12 0x0000000fc02494c4 in sched::thread::main (this=0x400041845040) at core/sched.cc:1419
#13 sched::thread_main_c (t=0x400041845040) at arch/aarch64/arch-switch.hh:189
#14 0x0000000fc0173788 in thread_main () at arch/aarch64/entry.S:135

From what I could tell, the main thread 41 would get stuck in sleep(1ms) - line 105, where thread 46 would wait on condvar. So the real issue here is thread 41 getting stuck instead of sleeping only for 1 ms.

On a different ARM machine (in the middle as far as the performance goes), the test would crash instead, after 54k runs, like this:

Booted up in 48.16 ms
Cmdline:  /tests/tst-pin.so
Assertion failed: sched::preemptable() (arch/aarch64/mmu.cc: page_fault: 104)

[backtrace]
0x0000000fc00e0530 <__assert_fail+32>
0x0000000fc0170a1c <page_fault+396>
0x0000000fc01702ac <???+-1072233812>
0x0000000fc02492a0 <sched::timer_list::fired()+64>
0x0000000fc0171700 <interrupt_table::invoke_interrupt(unsigned int)+208>
0x0000000fc01718c4 <interrupt+192>
0x0000000fc0170700 <???+-1072232704>
0x0000000fc0248134 <sched::cpu::idle()+52>
0x0000000fc024815c <???+-1071349412>
0x0000000fc0249250 <thread_main_c+32>
0x0000000fc016f784 <???+-1072236668>

(gdb) bt
#0  processor::halt_no_interrupts () at arch/aarch64/processor.hh:56
#1  arch::halt_no_interrupts () at arch/aarch64/arch.hh:64
#2  osv::halt () at arch/aarch64/power.cc:24
#3  0x0000000fc00e0508 in abort (fmt=fmt@entry=0xfc04ad548 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:146
#4  0x0000000fc00e0534 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at runtime.cc:154
#5  0x0000000fc0170a20 in page_fault (ef=0x4000409b6c30) at arch/aarch64/mmu.cc:104
#6  <signal handler called>
#7  0x0000000fc024c160 in boost::intrusive::list_node_traits<void*>::set_next (next=0x0, n=0x0) at /usr/include/boost/intrusive/detail/list_node.hpp:66
#8  boost::intrusive::circular_list_algorithms<boost::intrusive::list_node_traits<void*> >::unlink (this_node=<optimized out>)
    at /usr/include/boost/intrusive/circular_list_algorithms.hpp:154
#9  boost::intrusive::list_impl<boost::intrusive::mhtraits<sched::timer_base, boost::intrusive::list_member_hook<>, &sched::timer_base::hook>, unsigned long, true, void>::pop_front_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) (disposer=..., this=<optimized out>) at /usr/include/boost/intrusive/list.hpp:355
#10 boost::intrusive::list_impl<boost::intrusive::mhtraits<sched::timer_base, boost::intrusive::list_member_hook<>, &sched::timer_base::hook>, unsigned long, true, void>::pop_front() (
    this=<optimized out>) at /usr/include/boost/intrusive/list.hpp:338
#11 timer_set<sched::timer_base, &sched::timer_base::hook, osv::clock::uptime>::expire (this=0x4000407950c8, now=osv::clock::uptime time_point = { 52220875ns })
    at include/osv/timer-set.hh:180
#12 0x0000000fc02492a4 in sched::timer_list::fired (this=0x4000407950c0) at core/sched.cc:1668
#13 0x0000000fc0171704 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591
#14 interrupt_table::invoke_interrupt (this=this@entry=0xfc07020c8 <idt>, id=id@entry=27) at arch/aarch64/exceptions.cc:146
#15 0x0000000fc01718c8 in interrupt (frame=<optimized out>) at arch/aarch64/exceptions.cc:172
#16 <signal handler called>
#17 sched::cpu::do_idle (this=0x400040794040) at /usr/include/boost/intrusive/detail/size_holder.hpp:36
#18 0x00004000409a7040 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) osv info threads
  id  address             name            cpu  status         vruntime  total_cpu_time location
   1 (0x400040782040) reclaimer       cpu0 waiting     4.03251e-24            6416 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:53
   2 (0x400040868040) page_pool_l2    cpu0 waiting     2.62303e-21         3791376 memory::page_pool::l2::fill_thread() at core/mempool.cc:1525
   3 (0x400040893040) itimer-real     cpu? unstarted             0               0 ?? at arch/aarch64/entry.S:135
   4 (0x4000408b5040) itimer-virt     cpu? unstarted             0               0 ?? at arch/aarch64/entry.S:135
   5 (0x4000408f8040) balancer0       cpu0 waiting     6.82812e-23           98875 ?? at arch/aarch64/entry.S:135
   6 (0x40004091a040) rcu0            cpu0 waiting     9.70645e-24           15459 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
   7 (0x40004093f040) page_pool_l1_0  cpu0 waiting     6.92777e-22         1013250 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
   8 (0x400040961040) percpu0         cpu0 waiting     4.58827e-24            7292 workman::call_of_duty() at core/percpu-worker.cc:92
   9 (0x400040984040) async_worker0   cpu0 waiting     4.95016e-24            7875 async::async_worker::run() at core/async.cc:177
  10 (0x4000409a7040) idle0           cpu0 running             inf         4190087 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/13/bits/std_function.h:290
  11 (0x4000409c9040) idle1           cpu1 running             inf        26929586 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/13/bits/std_function.h:290
  12 (0x400040a12040) idle2           cpu2 running             inf        16272078 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/13/bits/std_function.h:290
  13 (0x400040a5b040) idle3           cpu3 running             inf        18641292 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/13/bits/std_function.h:290
  14 (0x400040aa4040) rcu3            cpu3 waiting     1.20846e-23           11666 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  15 (0x400040ac9040) >init           cpu0 waiting      1.2188e-23            9334 std::_Function_handler<void(), sched::thread::reaper::reaper()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/13/bits/std_function.h:290
  16 (0x400040ada040) page_pool_l1_3  cpu3 waiting     1.22397e-23           11958 memory::page_pool::l1::fill_thread() at core/mempool.cc:1407
  17 (0x400040b0e040) percpu3         cpu3 waiting     1.04819e-23            8750 workman::call_of_duty() at core/percpu-worker.cc:92
  18 (0x400040b35040) async_worker3   cpu3 waiting      1.0469e-23            8750 async::async_worker::run() at core/async.cc:177
  19 (0x400040b58040) thread taskq    cpu0 waiting     1.64273e-23           16042 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
  20 (0x400040b6e040) callout         cpu0 waiting     5.55891e-23           25375 _callout_thread() at bsd/porting/callout.cc:138
  21 (0x400040b92040) rcu1            cpu1 waiting     1.50314e-23           16917 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  22 (0x400040bbd040) page_pool_l1_1  cpu1 waiting     1.25628e-23           11959 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
  23 (0x400040be3040) percpu1         cpu1 waiting     9.97787e-24            7875 workman::call_of_duty() at core/percpu-worker.cc:92
  24 (0x400040c09040) async_worker1   cpu1 waiting     1.01588e-23            8167 async::async_worker::run() at core/async.cc:177
  25 (0x400040c40040) rcu2            cpu2 waiting     1.27696e-23           12833 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:184
  26 (0x400040c68040) page_pool_l1_2  cpu2 waiting     5.32834e-22          921376 memory::page_pool::l1::fill_thread() at core/mempool.cc:1410
  27 (0x400040c8e040) percpu2         cpu2 waiting     1.08438e-23            8750 workman::call_of_duty() at core/percpu-worker.cc:92
  28 (0x4000412b3040) async_worker2   cpu2 waiting     1.06758e-23            8458 async::async_worker::run() at core/async.cc:177
  29 (0x400041315040) netisr          cpu0 waiting     2.43243e-23           15750 netisr_osv_thread_wrapper(void (*)(void*), void*) at bsd/sys/net/netisr1.cc:26
  30 (0x4000413c6040) >init           cpu0 waiting     1.00402e-20        14702916 sched::thread::join() at core/sched.cc:1565
	joining on <optimized out>
  31 (0x4000413e2040) virtio-net-rx   cpu0 waiting     2.46995e-28           25374 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:159
  32 (0x400041415040) virtio-tx-0     cpu0 waiting     1.33768e-28            9333 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  33 (0x400041448040) virtio-tx-1     cpu1 waiting     1.50057e-28           24500 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  34 (0x40004147b040) virtio-tx-2     cpu2 waiting     1.10765e-28           18084 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  35 (0x4000414ae040) virtio-tx-3     cpu3 waiting     1.14384e-28           18666 osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<o--Type <RET> for more, q to quit, c to continue without paging--c
sv::xmitter_functor<virtio::net::txq> > >::poll_until() at include/osv/percpu_xmit.hh:401
  36 (0x4000414e0040) virtio-blk      cpu0 waiting     2.71548e-23           28582 virtio::virtio_driver::wait_for_queue(virtio::vring*, bool (virtio::vring::*)() const) at drivers/virtio.cc:159
  37 (0x400041502040) virtio-rng      cpu0 waiting     3.81666e-23           24208 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:53
  38 (0x400041527040) pl011-input     cpu0 waiting     1.70994e-23           14292 console::LineDiscipline::read_poll(console::console_driver*) at drivers/line-discipline.cc:86
  39 (0x400041572040) rand_harvestq   cpu0 waiting     1.50573e-23           11375 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:102
  40 (0x4000415a2040) dhcp            cpu0 waiting     1.72739e-22          234208 dhcp::dhcp_worker::dhcp_worker_fn() at core/dhcp.cc:830
  41 (0x400041875040) /tests/tst-pin. cpu0 waking      7.98084e-23          104709 main(int, char**) at /usr/include/c++/13/bits/stl_vector.h:1126
  43 (0x40004188f040) >/tests/tst-pin cpu1 terminated -4.11962e-22            2042 ?? at arch/aarch64/entry.S:135
Number of threads: 42

It seems like a corrupt boost intrusive list under one of the buckets in the osv::timer_set::expire() - line 180.

In another instance on the same machine, it would crash after 15k runs, a little bit differently:

Cmdline:  /tests/tst-pin.so
Assertion failed: !node_algorithms::inited(this->priv_value_traits().to_node_ptr(value)) (/usr/include/boost/intrusive/list.hpp: iterator_to: 1310)

[backtrace]
0x0000000fc00e0530 <__assert_fail+32>
0x0000000fc02442f4 <???+-1071365388>
0x0000000fc0247394 <sched::timer_base::cancel()+336>
0x0000000fc024c918 <sched::cpu::schedule_next_thread(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+728>
0x0000000fc0220474 <???+-1071512460>
0x0000000fc0247bb4 <sched::thread::complete()+340>
0x0000000fc0249258 <thread_main_c+40>
0x0000000fc016f784 <???+-1072236668>

I am unsure if these crashes are related or different, but given that they happen when running tst-pin.cc I put them under the same issue.

On my 3rd arm machine (the slowest one), the test would never hang or crash after 49K and ~90K runs.

I have not tried to run this test enough times on x86_64 machine, so who knows if it is arm specific.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions