-
Notifications
You must be signed in to change notification settings - Fork 124
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
We see two types of crashes (hardfaults) during reconnection.
- Hardfault in the publishing thread.
>>> bt
#0 exception_common () at NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_exception.S:144
#1 <signal handler called>
#2 0x08121bdc in _z_vec_get (v=0x30002798, i=0) at zenoh-pico/src/collections/vec.c:114
#3 0x081370b4 in _z_iosli_vec_get (v=0x30002798, pos=0) at zenoh-pico/include/zenoh-pico/protocol/iobuf.h:68
#4 0x0813781e in _z_wbuf_get_iosli (wbf=0x30002798, idx=0) at zenoh-pico/src/protocol/iobuf.c:287
#5 0x08137978 in _z_wbuf_write (wbf=0x30002798, b=37 '%') at zenoh-pico/src/protocol/iobuf.c:384
#6 0x08135fbc in _z_transport_message_encode (wbf=0x30002798, msg=0x30010b7c) at zenoh-pico/src/protocol/codec/transport.c:587
#7 0x0812ac90 in _z_transport_tx_send_n_msg_inner (ztc=0x30002728, n_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, peers=0x0) at zenoh-pico/src/transport/common/tx.c:227
#8 0x0812ae6c in _z_transport_tx_send_n_msg (ztc=0x30002728, n_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, peers=0x0) at zenoh-pico/src/transport/common/tx.c:294
#9 0x0812b168 in _z_send_n_msg (zn=0x30002710, z_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, peer=0x0) at zenoh-pico/src/transport/common/tx.c:473
#10 0x08122c58 in _z_write (zn=0x30002710, keyexpr=..., payload=..., encoding=0x30010fd4, kind=Z_SAMPLE_KIND_PUT, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, priority=Z_PRIORITY_INTERACTIVE_HIGH, is_express=false, timestamp=0x0, attachment=..., reliability=Z_RELIABILITY_RELIABLE, source_info=0x0) at zenoh-pico/src/net/primitives.c:246
#11 0x0811fbb2 in z_publisher_put (pub=0x2407c480, payload=0x30011028, options=0x0) at zenoh-pico/src/api/api.c:1135
The hardfault happens because _val == 0x0 and v->_val[i]; is executed.
- Hardfault in the read task.
>>> bt
exception_common@0x080202a8 (nuttx/arch/arm/src/armv7-m/gnu/arm_exception.S:144)
<signal handler called>@0xffffffe9 (Unknown Source:0)
file_socket@0x08036f14 (nuttx/fs/socket/socket.c:192)
sockfd_socket@0x08036f38 (nuttx/fs/socket/socket.c:209)
recvfrom@0x0803dc10 (nuttx/net/socket/recvfrom.c:207)
_z_read_udp_unicast@0x081315ac (zenoh-pico/src/system/unix/network.c:386)
_z_f_link_udp_read_socket@0x08133cf6 (zenoh-pico/src/link/unicast/udp.c:175)
_z_link_socket_recv_zbuf@0x08132e62 (zenoh-pico/src/link/link.c:169)
_z_unicast_client_read@0x0812f90e (zenoh-pico/src/transport/unicast/read.c:128)
_zp_unicast_read_task@0x0812f98a (zenoh-pico/src/transport/unicast/read.c:353)
pthread_startup@0x08030d8e (nuttx/libs/libc/pthread/pthread_create.c:59)
The hardfault happens because the inode was cleared (f_inode == 0x0) and f_inode->i_flags is executed.
Workaround
For now we resolved hardfault 1) by synchronizing our publisher threads with _z_common_transport_clear
and hardfault 2) by enforcing pthread_join for the read task instead of pthread_detach during reconnection clean-up.
To reproduce
- Publish and receive data on a zenoh pico client with
Z_FEATURE_AUTO_RECONNECTenabled. - Disconnect from
zenohd(e.g. by restarting the router process or unplugging the ethernet cable) - Repeat 2. until you trigger a hardfault.
Note:
This is not always reproducible. It roughly happens 1/15 times.
We run several publishing threads.
The problem is easier to reproduce if the publishing threads have higher priority than the lease task.
System info
- STM32H7
- Zenoh Pico (1.4.0) on NuttX (Unix)
- Configuration: Client mode with
Z_FEATURE_AUTO_RECONNECTenabled in UDP unicast.
floriantschopp, schneith, vanurag and dmammoloschneith and floriantschopp
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working