Skip to content

mDNS: Multiple race conditions, use-after-free, and memory leaks in search/browse lifecycle (IDFGH-17257) #1012

@jonsmirl

Description

@jonsmirl

Summary

We discovered multiple race conditions, use-after-free bugs, and memory leaks in the mDNS component (v1.9.1) while running a Matter controller (connectedhomeip) that manages 200 devices simultaneously. These bugs cause crashes, permanent memory leaks, and eventual action queue starvation under sustained load.

Test environment: ESP32-C61 (RISC-V, single-core), ESP-IDF v5.5, connectedhomeip Matter SDK. The Matter SDK uses mdns_browse_new() for PTR service discovery and mdns_query_async_new() / mdns_query_async_delete() for SRV/TXT/AAAA resolution — generating hundreds of concurrent async queries.

Patched code: https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions (diff against mdns-v1.9.1 tag — all changes in components/mdns/ only)

After applying these fixes, the system has been running continuously for multiple days at 200 devices with zero memory drift (internal RAM pinned at 42KB, PSRAM at 783KB).


Bug 1: Use-After-Free in search_once Linked List

Severity: Crash
Time to reproduce: ~46–113 seconds under load

Root cause: The 100ms timer in _mdns_search_run() sets s->state = SEARCH_OFF and then queues ACTION_SEARCH_END. Between setting the state and the END action being processed, the CHIP SDK (running on another task) calls mdns_query_async_delete(), which sees state == SEARCH_OFF and frees the search struct. But the search is still in the search_once linked list. When the timer fires again, _mdns_search_run() traverses the list and dereferences freed memory (heap-poisoned to 0xfefefefe).

Fix: Add _mdns_search_is_in_list() helper. All code paths (ACTION_SEARCH_SEND, ACTION_SEARCH_END, _mdns_search_finish_done()) now verify the search pointer is still in the linked list before accessing it. Added pointer validation (esp_ptr_internal/esp_ptr_external_ram) to detect corruption early.


Bug 2: Search Leak → Permanent Action Queue Starvation

Severity: Permanent failure
Time to reproduce: ~10.4 hours under load

Root cause: In _mdns_search_run(), when _mdns_send_search_action(ACTION_SEARCH_END, s) fails because the action queue is full, the code reverts to s->state = SEARCH_RUNNING. This means the search is never finished — it stays in the search_once linked list permanently. Every 100ms timer tick, it re-attempts to queue the END action, which fails again (the queue is congested from all the leaked searches). Eventually the action queue is permanently full and no mDNS operations work.

// BEFORE (buggy):
if (_mdns_send_search_action(ACTION_SEARCH_END, s) != ESP_OK) {
    s->state = SEARCH_RUNNING;  // <-- leaked forever!
}

Fix: Call _mdns_search_finish(s) directly instead of reverting. We already hold MDNS_SERVICE_LOCK, so this is safe. The search is properly removed from the list and freed.


Bug 3: Delete-Before-ADD Race (Use-After-Free)

Severity: Crash
Time to reproduce: ~46–74 seconds under load

Root cause: mdns_query_async_new() queues ACTION_SEARCH_ADD and returns the search handle to the caller. If the caller's error path calls mdns_query_async_delete() before the ADD action is processed by the mDNS task, the search is freed. When the mDNS task later processes the ADD action, it adds the freed pointer to the linked list. The next timer tick crashes trying to access done_semaphore at 0xfefefefe (heap poison pattern).

This is a real pattern in connectedhomeip: the CHIP platform layer calls mdns_query_async_new(), then immediately checks for errors and may call mdns_query_async_delete() before the mDNS task processes the queued ADD.

Fix: mdns_query_async_delete() now handles three cases:

  1. In list: detach + free immediately (normal case, search was running)
  2. Not in list, was already SEARCH_OFF: search was finished by _mdns_search_finish() — free it now (fixes Bug 4 below)
  3. Not in list, wasn't SEARCH_OFF: ADD is pending in queue — set state = SEARCH_OFF but don't free. The ADD handler checks state and frees instead of adding.

Bug 4: Finished-Search Semaphore Leak (92 bytes per query)

Severity: Steady memory leak
Time to reproduce: Immediate, ~6 allocs per 30 seconds

Root cause: _mdns_search_finish() detaches the search from the linked list and calls the notifier. When mdns_query_async_delete() runs later, it can't find the search in the list. The old code assumed "not in list" means "ADD is pending" and skipped the free. But "not in list" can also mean "already finished and detached." The search struct (including its 92-byte FreeRTOS semaphore) was leaked.

This affected ALL mDNS queries — both the collector's PTR rebrowse and CHIP's SRV/TXT/AAAA resolves.

Fix: Check was_already_off = (search->state == SEARCH_OFF) before setting it. If the search was already SEARCH_OFF and not in the list, it was finished — free it. (See Bug 3 fix, case 2.)


Bug 5: Lock Leak in mdns_service_txt_set_for_host()

Severity: Deadlock

Root cause: Early return ESP_ERR_NO_MEM when _mdns_allocate_txt() fails doesn't release MDNS_SERVICE_LOCK. Next mDNS operation deadlocks.

// BEFORE (buggy):
if (!new_txt) {
    return ESP_ERR_NO_MEM;  // <-- lock still held!
}

// AFTER:
if (!new_txt) {
    ret = ESP_ERR_NO_MEM;
    goto err;  // err: label releases lock
}

Bug 6: Results Ownership Double-Free

Severity: Crash (heap corruption)

Root cause: mdns_query_async_get_results() returns a pointer to search->result but doesn't clear the search's copy. If the caller frees the results and then calls mdns_query_async_delete(), the same results list is freed again via _mdns_search_free().

Fix: After returning results to the caller, set search->result = NULL and search->num_results = 0 to transfer ownership.


Bug 7: _mdns_free_action() Fallthrough Causes Double-Free

Severity: Crash (on shutdown or queue drain)

Root cause: ACTION_SEARCH_ADD falls through to ACTION_SEARCH_SEND and ACTION_SEARCH_END in _mdns_free_action(). This means:

  • ADD actions free their search (correct — ADD owns the search before it's in the list)
  • SEND/END actions also try to free their search (wrong — the search may have already been freed by mdns_query_async_delete())

Fix: Separate ADD from SEND/END:

  • ACTION_SEARCH_ADD: Free results + search (ADD owns it before adding to list)
  • ACTION_SEARCH_SEND / ACTION_SEARCH_END: Don't free — these reference searches in the linked list that will be freed by the normal lifecycle

Bug 8: ACTION_BROWSE_SYNC Queue Overflow Under Load

Severity: Functional failure (missed browse notifications)

Root cause: Every parsed mDNS response that matches a browse allocates a mdns_browse_sync_t, fills it with results, and queues ACTION_BROWSE_SYNC. With 200 devices continuously advertising, this adds significant pressure to the action queue. When the queue is full, the sync is dropped silently and the browse results are lost.

Fix: Replace ACTION_BROWSE_SYNC with inline notification at end-of-packet. After parsing all answers in mdns_parse_packet(), call browse->notifier() directly instead of queuing an action. This eliminates the extra allocation, queue pressure, and the mdns_browse_sync_t / mdns_browse_result_sync_t data structures entirely.


Additional Improvements (in the same patch)

Action Queue Batch Drain

On single-core systems (ESP32-C3, C61), each MDNS_SERVICE_LOCK / UNLOCK cycle is a context switch opportunity. Processing one action per lock acquisition creates a bottleneck. The fix batches up to 8 actions per lock hold, improving drain rate during congestion.

RX Packet Pre-Filter (mdns_networking_lwip.c)

A lock-free pre-filter in the lwIP receive callback (_mdns_rx_is_relevant()) drops response packets that don't contain any service names matching our browses, searches, or registered services. On a busy network with many mDNS advertisers (printers, Chromecasts, AirPlay, etc.), this dramatically reduces action queue pressure. Query packets always pass through.

RX Queue Drop Logging

_mdns_send_rx_action() now logs dropped packets with rate limiting, making queue congestion issues visible.

Dead Code Removal

Removed ~370 lines of #if 0 dead code: _mdns_browse_result_add_srv(), _mdns_browse_result_add_ip(), _mdns_browse_result_add_txt(), _mdns_browse_sync(), _mdns_sync_browse_result_link_free(), _mdns_copy_address_in_previous_result(), associated forward declarations, ACTION_BROWSE_SYNC enum value, mdns_browse_sync_t / mdns_browse_result_sync_t structs, and browse->result field from mdns_browse_t.


Testing

We have been torture-testing these fixes on an ESP32-C61 running a Matter controller with 200 simultaneous Matter devices (simulated via connectedhomeip example apps on Linux). The test setup generates:

  • Continuous mdns_browse_new() for _matter._tcp PTR records
  • Hundreds of concurrent mdns_query_async_new() / mdns_query_async_delete() cycles for SRV, TXT, and AAAA resolution
  • Continuous subscription liveness timeouts and reconnections (~1 every 17 seconds)

Results after applying fixes:

  • 6+ hours continuous operation: Zero crashes, zero memory leaks
  • Multi-day runs: Internal RAM pinned at 42,427 bytes free (zero drift), PSRAM at ~783KB free (zero drift)
  • All 200 devices: Discovered, resolved, and subscribed successfully
  • Recovery: Occasional UDP buffer exhaustion recovers automatically within seconds

Before fixes: Crashes within 46–113 seconds, permanent mDNS failure at ~10.4 hours, steady 92-byte leak per query.


How to Review

The branch https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions contains a single commit on top of the mdns-v1.9.1 tag. The diff is clean:

components/mdns/include/mdns.h                 |  17 +-
components/mdns/mdns.c                         | 697 ++++++---
components/mdns/mdns_console.c                 |   6 +-
components/mdns/mdns_networking_lwip.c          | 128 ++
components/mdns/private_include/mdns_private.h  |  15 -
5 files changed, 383 insertions(+), 480 deletions(-)

Note: The patch also includes references to an mdns_cache module (used in our application for indexed device lookup). These cache calls can be ignored or stubbed — the core race condition fixes are independent.

I'm happy to help port these fixes to the refactored modular codebase if that would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions