Skip to content

Conversation

@steils
Copy link
Member

@steils steils commented Dec 16, 2025

On Zephyr, _z_task_init() assigns preallocated pthread stacks from thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++, which eventually indexed past thread_stack_area[], corrupting memory and causing crashes.

Replace thread_index++ with a stack-slot pool. When attr == NULL, pick a free slot in thread_stack_area[], set it with pthread_attr_setstack(), and start the thread.
Slot release is now done with a thread-specific data key destructor.

Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the stack pool size.

Fixes: #1064

Reproducing

  1. Create a new Zephyr project according to Zenoh-Pico Readme, with the following main.c:
#include <stdio.h>
#include <string.h>
#include <zenoh-pico.h>
#include <unistd.h>

#define MODE "client"
#define LOCATOR "tcp/192.168.11.1:7447"

#define KEY_SUB "demo/example/h753/sub"

static void sub_handler(z_loaned_sample_t *s, void *arg) {
    (void)arg;
    z_view_string_t k;
    z_keyexpr_as_view_string(z_sample_keyexpr(s), &k);
    z_owned_string_t v;
    z_bytes_to_string(z_sample_payload(s), &v);
    printf("[sub] %.*s = %.*s\n",
           (int)z_string_len(z_loan(k)), z_string_data(z_loan(k)),
           (int)z_string_len(z_loan(v)), z_string_data(z_loan(v)));
    z_drop(z_move(v));
}

int main(void) {
    printf("zenoh-pico reconnection reproduction start\n");

    z_owned_config_t cfg;
    z_config_default(&cfg);
    zp_config_insert(z_loan_mut(cfg), Z_CONFIG_MODE_KEY, MODE);
    if (strlen(LOCATOR) > 0) {
        zp_config_insert(z_loan_mut(cfg), Z_CONFIG_CONNECT_KEY, LOCATOR);
    }

    z_owned_session_t sess;
    if (z_open(&sess, z_move(cfg), NULL) < 0) {
        printf("Unable to open session\n");
        return -1;
    }
    printf("Session opened\n");

    zp_start_read_task(z_loan_mut(sess), NULL);
    zp_start_lease_task(z_loan_mut(sess), NULL);

    z_view_keyexpr_t ke_sub;
    z_view_keyexpr_from_str_unchecked(&ke_sub, KEY_SUB);
    z_owned_closure_sample_t sub_cb;
    z_closure(&sub_cb, sub_handler, NULL, NULL);
    z_owned_subscriber_t sub;
    if (z_declare_subscriber(z_loan(sess), &sub, z_loan(ke_sub), z_move(sub_cb), NULL) < 0) {
        printf("Unable to declare subscriber\n");
        return -2;
    }
    printf("Subscriber declared on %s\n", KEY_SUB);

    for (int tick = 0;; ++tick) {
        printf("alive tick=%d\n", tick);
        sleep(1);
    }
    return 0;
}
  1. Connect the board and run:
pio run
pio run -t upload
  1. Verify messages arrive:
    [sub] demo/example/h753/sub = ...

  2. Reproduce reconnection:

    • unplug Ethernet cable for ~3-5 seconds
    • plug it back in
    • repeat 4-5 times

Expected result before the fix (fail)

After several reconnects, the firmware will crash due to corrupted stack like in issue #1064.

Expected result (pass)

  • No crashes
  • The board keeps printing alive tick=...
  • After each reconnect, the subscriber resumes receiving [sub] ... messages.
  • With -DZENOH_LOG_DEBUG, you may also see zenoh-pico debug logs; there should be no "slot OOM" errors.

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: bug

🐛 Bug Fix Requirements

Since this PR is labeled as a bug fix, please ensure:

  • Root cause documented - Explain what caused the bug in the PR description
  • Reproduction test added - Test that fails on main branch without the fix
  • Test passes with fix - The reproduction test passes with your changes
  • Regression prevention - Test will catch if this bug reoccurs in the future
  • Fix is minimal - Changes are focused only on fixing the bug
  • Related bugs checked - Verified no similar bugs exist in related code

Why this matters: Bugs without tests often reoccur.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

)

On Zephyr, _z_task_init() assigns preallocated pthread stacks from
thread_stack_area[]. During repeated link resets/reconnects, zenoh-pico
recreates the read/lease threads.
Previous implementation used a constantly increasing thread_index++,
which eventually indexed past thread_stack_area[], corrupting memory and
causing crashes.

Replace thread_index++ with a stack-slot pool. When attr == NULL, pick
a free slot in thread_stack_area[], set it with pthread_attr_setstack(),
and start the thread.
Slot release is now done with a thread-specific data key destructor.

Add CONFIG_ZENOH_PICO_ZEPHYR_THREADS_NUM (default 4) to configure the
stack pool size.
@steils steils added the bug Something isn't working label Dec 16, 2025
@steils steils requested review from gmartin82 and sashacmc December 16, 2025 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] STM32 nucleo board is getting USAGE FAULT Error when ethernet connection is restored

1 participant