Skip to content

[Bug] Crash of client during clean-up in the reconnection mechanism #1033

@sebglatz

Description

@sebglatz

Describe the bug

We see two types of crashes (hardfaults) during reconnection.

  1. Hardfault in the publishing thread.
>>> bt
#0  exception_common () at NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_exception.S:144
#1  <signal handler called>
#2  0x08121bdc in _z_vec_get (v=0x30002798, i=0) at zenoh-pico/src/collections/vec.c:114
#3  0x081370b4 in _z_iosli_vec_get (v=0x30002798, pos=0) at zenoh-pico/include/zenoh-pico/protocol/iobuf.h:68
#4  0x0813781e in _z_wbuf_get_iosli (wbf=0x30002798, idx=0) at zenoh-pico/src/protocol/iobuf.c:287
#5  0x08137978 in _z_wbuf_write (wbf=0x30002798, b=37 '%') at zenoh-pico/src/protocol/iobuf.c:384
#6  0x08135fbc in _z_transport_message_encode (wbf=0x30002798, msg=0x30010b7c) at zenoh-pico/src/protocol/codec/transport.c:587
#7  0x0812ac90 in _z_transport_tx_send_n_msg_inner (ztc=0x30002728, n_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, peers=0x0) at zenoh-pico/src/transport/common/tx.c:227
#8  0x0812ae6c in _z_transport_tx_send_n_msg (ztc=0x30002728, n_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, peers=0x0) at zenoh-pico/src/transport/common/tx.c:294
#9  0x0812b168 in _z_send_n_msg (zn=0x30002710, z_msg=0x30010d38, reliability=Z_RELIABILITY_RELIABLE, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, peer=0x0) at zenoh-pico/src/transport/common/tx.c:473
#10 0x08122c58 in _z_write (zn=0x30002710, keyexpr=..., payload=..., encoding=0x30010fd4, kind=Z_SAMPLE_KIND_PUT, cong_ctrl=Z_CONGESTION_CONTROL_BLOCK, priority=Z_PRIORITY_INTERACTIVE_HIGH, is_express=false, timestamp=0x0, attachment=..., reliability=Z_RELIABILITY_RELIABLE, source_info=0x0) at zenoh-pico/src/net/primitives.c:246
#11 0x0811fbb2 in z_publisher_put (pub=0x2407c480, payload=0x30011028, options=0x0) at zenoh-pico/src/api/api.c:1135

The hardfault happens because _val == 0x0 and v->_val[i]; is executed.

  1. Hardfault in the read task.
>>> bt
exception_common@0x080202a8 (nuttx/arch/arm/src/armv7-m/gnu/arm_exception.S:144)
<signal handler called>@0xffffffe9 (Unknown Source:0)
file_socket@0x08036f14 (nuttx/fs/socket/socket.c:192)
sockfd_socket@0x08036f38 (nuttx/fs/socket/socket.c:209)
recvfrom@0x0803dc10 (nuttx/net/socket/recvfrom.c:207)
_z_read_udp_unicast@0x081315ac (zenoh-pico/src/system/unix/network.c:386)
_z_f_link_udp_read_socket@0x08133cf6 (zenoh-pico/src/link/unicast/udp.c:175)
_z_link_socket_recv_zbuf@0x08132e62 (zenoh-pico/src/link/link.c:169)
_z_unicast_client_read@0x0812f90e (zenoh-pico/src/transport/unicast/read.c:128)
_zp_unicast_read_task@0x0812f98a (zenoh-pico/src/transport/unicast/read.c:353)
pthread_startup@0x08030d8e (nuttx/libs/libc/pthread/pthread_create.c:59)

The hardfault happens because the inode was cleared (f_inode == 0x0) and f_inode->i_flags is executed.

Workaround

For now we resolved hardfault 1) by synchronizing our publisher threads with _z_common_transport_clear
and hardfault 2) by enforcing pthread_join for the read task instead of pthread_detach during reconnection clean-up.

To reproduce

  1. Publish and receive data on a zenoh pico client with Z_FEATURE_AUTO_RECONNECT enabled.
  2. Disconnect from zenohd (e.g. by restarting the router process or unplugging the ethernet cable)
  3. Repeat 2. until you trigger a hardfault.

Note:
This is not always reproducible. It roughly happens 1/15 times.
We run several publishing threads.
The problem is easier to reproduce if the publishing threads have higher priority than the lease task.

System info

  • STM32H7
  • Zenoh Pico (1.4.0) on NuttX (Unix)
  • Configuration: Client mode with Z_FEATURE_AUTO_RECONNECT enabled in UDP unicast.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions