Summary
We discovered multiple race conditions, use-after-free bugs, and memory leaks in the mDNS component (v1.9.1) while running a Matter controller (connectedhomeip) that manages 200 devices simultaneously. These bugs cause crashes, permanent memory leaks, and eventual action queue starvation under sustained load.
Test environment: ESP32-C61 (RISC-V, single-core), ESP-IDF v5.5, connectedhomeip Matter SDK. The Matter SDK uses mdns_browse_new() for PTR service discovery and mdns_query_async_new() / mdns_query_async_delete() for SRV/TXT/AAAA resolution — generating hundreds of concurrent async queries.
Patched code: https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions (diff against mdns-v1.9.1 tag — all changes in components/mdns/ only)
After applying these fixes, the system has been running continuously for multiple days at 200 devices with zero memory drift (internal RAM pinned at 42KB, PSRAM at 783KB).
Bug 1: Use-After-Free in search_once Linked List
Severity: Crash
Time to reproduce: ~46–113 seconds under load
Root cause: The 100ms timer in _mdns_search_run() sets s->state = SEARCH_OFF and then queues ACTION_SEARCH_END. Between setting the state and the END action being processed, the CHIP SDK (running on another task) calls mdns_query_async_delete(), which sees state == SEARCH_OFF and frees the search struct. But the search is still in the search_once linked list. When the timer fires again, _mdns_search_run() traverses the list and dereferences freed memory (heap-poisoned to 0xfefefefe).
Fix: Add _mdns_search_is_in_list() helper. All code paths (ACTION_SEARCH_SEND, ACTION_SEARCH_END, _mdns_search_finish_done()) now verify the search pointer is still in the linked list before accessing it. Added pointer validation (esp_ptr_internal/esp_ptr_external_ram) to detect corruption early.
Bug 2: Search Leak → Permanent Action Queue Starvation
Severity: Permanent failure
Time to reproduce: ~10.4 hours under load
Root cause: In _mdns_search_run(), when _mdns_send_search_action(ACTION_SEARCH_END, s) fails because the action queue is full, the code reverts to s->state = SEARCH_RUNNING. This means the search is never finished — it stays in the search_once linked list permanently. Every 100ms timer tick, it re-attempts to queue the END action, which fails again (the queue is congested from all the leaked searches). Eventually the action queue is permanently full and no mDNS operations work.
// BEFORE (buggy):
if (_mdns_send_search_action(ACTION_SEARCH_END, s) != ESP_OK) {
s->state = SEARCH_RUNNING; // <-- leaked forever!
}
Fix: Call _mdns_search_finish(s) directly instead of reverting. We already hold MDNS_SERVICE_LOCK, so this is safe. The search is properly removed from the list and freed.
Bug 3: Delete-Before-ADD Race (Use-After-Free)
Severity: Crash
Time to reproduce: ~46–74 seconds under load
Root cause: mdns_query_async_new() queues ACTION_SEARCH_ADD and returns the search handle to the caller. If the caller's error path calls mdns_query_async_delete() before the ADD action is processed by the mDNS task, the search is freed. When the mDNS task later processes the ADD action, it adds the freed pointer to the linked list. The next timer tick crashes trying to access done_semaphore at 0xfefefefe (heap poison pattern).
This is a real pattern in connectedhomeip: the CHIP platform layer calls mdns_query_async_new(), then immediately checks for errors and may call mdns_query_async_delete() before the mDNS task processes the queued ADD.
Fix: mdns_query_async_delete() now handles three cases:
- In list: detach + free immediately (normal case, search was running)
- Not in list, was already SEARCH_OFF: search was finished by
_mdns_search_finish() — free it now (fixes Bug 4 below)
- Not in list, wasn't SEARCH_OFF: ADD is pending in queue — set
state = SEARCH_OFF but don't free. The ADD handler checks state and frees instead of adding.
Bug 4: Finished-Search Semaphore Leak (92 bytes per query)
Severity: Steady memory leak
Time to reproduce: Immediate, ~6 allocs per 30 seconds
Root cause: _mdns_search_finish() detaches the search from the linked list and calls the notifier. When mdns_query_async_delete() runs later, it can't find the search in the list. The old code assumed "not in list" means "ADD is pending" and skipped the free. But "not in list" can also mean "already finished and detached." The search struct (including its 92-byte FreeRTOS semaphore) was leaked.
This affected ALL mDNS queries — both the collector's PTR rebrowse and CHIP's SRV/TXT/AAAA resolves.
Fix: Check was_already_off = (search->state == SEARCH_OFF) before setting it. If the search was already SEARCH_OFF and not in the list, it was finished — free it. (See Bug 3 fix, case 2.)
Bug 5: Lock Leak in mdns_service_txt_set_for_host()
Severity: Deadlock
Root cause: Early return ESP_ERR_NO_MEM when _mdns_allocate_txt() fails doesn't release MDNS_SERVICE_LOCK. Next mDNS operation deadlocks.
// BEFORE (buggy):
if (!new_txt) {
return ESP_ERR_NO_MEM; // <-- lock still held!
}
// AFTER:
if (!new_txt) {
ret = ESP_ERR_NO_MEM;
goto err; // err: label releases lock
}
Bug 6: Results Ownership Double-Free
Severity: Crash (heap corruption)
Root cause: mdns_query_async_get_results() returns a pointer to search->result but doesn't clear the search's copy. If the caller frees the results and then calls mdns_query_async_delete(), the same results list is freed again via _mdns_search_free().
Fix: After returning results to the caller, set search->result = NULL and search->num_results = 0 to transfer ownership.
Bug 7: _mdns_free_action() Fallthrough Causes Double-Free
Severity: Crash (on shutdown or queue drain)
Root cause: ACTION_SEARCH_ADD falls through to ACTION_SEARCH_SEND and ACTION_SEARCH_END in _mdns_free_action(). This means:
- ADD actions free their search (correct — ADD owns the search before it's in the list)
- SEND/END actions also try to free their search (wrong — the search may have already been freed by
mdns_query_async_delete())
Fix: Separate ADD from SEND/END:
ACTION_SEARCH_ADD: Free results + search (ADD owns it before adding to list)
ACTION_SEARCH_SEND / ACTION_SEARCH_END: Don't free — these reference searches in the linked list that will be freed by the normal lifecycle
Bug 8: ACTION_BROWSE_SYNC Queue Overflow Under Load
Severity: Functional failure (missed browse notifications)
Root cause: Every parsed mDNS response that matches a browse allocates a mdns_browse_sync_t, fills it with results, and queues ACTION_BROWSE_SYNC. With 200 devices continuously advertising, this adds significant pressure to the action queue. When the queue is full, the sync is dropped silently and the browse results are lost.
Fix: Replace ACTION_BROWSE_SYNC with inline notification at end-of-packet. After parsing all answers in mdns_parse_packet(), call browse->notifier() directly instead of queuing an action. This eliminates the extra allocation, queue pressure, and the mdns_browse_sync_t / mdns_browse_result_sync_t data structures entirely.
Additional Improvements (in the same patch)
Action Queue Batch Drain
On single-core systems (ESP32-C3, C61), each MDNS_SERVICE_LOCK / UNLOCK cycle is a context switch opportunity. Processing one action per lock acquisition creates a bottleneck. The fix batches up to 8 actions per lock hold, improving drain rate during congestion.
RX Packet Pre-Filter (mdns_networking_lwip.c)
A lock-free pre-filter in the lwIP receive callback (_mdns_rx_is_relevant()) drops response packets that don't contain any service names matching our browses, searches, or registered services. On a busy network with many mDNS advertisers (printers, Chromecasts, AirPlay, etc.), this dramatically reduces action queue pressure. Query packets always pass through.
RX Queue Drop Logging
_mdns_send_rx_action() now logs dropped packets with rate limiting, making queue congestion issues visible.
Dead Code Removal
Removed ~370 lines of #if 0 dead code: _mdns_browse_result_add_srv(), _mdns_browse_result_add_ip(), _mdns_browse_result_add_txt(), _mdns_browse_sync(), _mdns_sync_browse_result_link_free(), _mdns_copy_address_in_previous_result(), associated forward declarations, ACTION_BROWSE_SYNC enum value, mdns_browse_sync_t / mdns_browse_result_sync_t structs, and browse->result field from mdns_browse_t.
Testing
We have been torture-testing these fixes on an ESP32-C61 running a Matter controller with 200 simultaneous Matter devices (simulated via connectedhomeip example apps on Linux). The test setup generates:
- Continuous
mdns_browse_new() for _matter._tcp PTR records
- Hundreds of concurrent
mdns_query_async_new() / mdns_query_async_delete() cycles for SRV, TXT, and AAAA resolution
- Continuous subscription liveness timeouts and reconnections (~1 every 17 seconds)
Results after applying fixes:
- 6+ hours continuous operation: Zero crashes, zero memory leaks
- Multi-day runs: Internal RAM pinned at 42,427 bytes free (zero drift), PSRAM at ~783KB free (zero drift)
- All 200 devices: Discovered, resolved, and subscribed successfully
- Recovery: Occasional UDP buffer exhaustion recovers automatically within seconds
Before fixes: Crashes within 46–113 seconds, permanent mDNS failure at ~10.4 hours, steady 92-byte leak per query.
How to Review
The branch https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions contains a single commit on top of the mdns-v1.9.1 tag. The diff is clean:
components/mdns/include/mdns.h | 17 +-
components/mdns/mdns.c | 697 ++++++---
components/mdns/mdns_console.c | 6 +-
components/mdns/mdns_networking_lwip.c | 128 ++
components/mdns/private_include/mdns_private.h | 15 -
5 files changed, 383 insertions(+), 480 deletions(-)
Note: The patch also includes references to an mdns_cache module (used in our application for indexed device lookup). These cache calls can be ignored or stubbed — the core race condition fixes are independent.
I'm happy to help port these fixes to the refactored modular codebase if that would be useful.
Summary
We discovered multiple race conditions, use-after-free bugs, and memory leaks in the mDNS component (v1.9.1) while running a Matter controller (connectedhomeip) that manages 200 devices simultaneously. These bugs cause crashes, permanent memory leaks, and eventual action queue starvation under sustained load.
Test environment: ESP32-C61 (RISC-V, single-core), ESP-IDF v5.5, connectedhomeip Matter SDK. The Matter SDK uses
mdns_browse_new()for PTR service discovery andmdns_query_async_new()/mdns_query_async_delete()for SRV/TXT/AAAA resolution — generating hundreds of concurrent async queries.Patched code: https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions (diff against
mdns-v1.9.1tag — all changes incomponents/mdns/only)After applying these fixes, the system has been running continuously for multiple days at 200 devices with zero memory drift (internal RAM pinned at 42KB, PSRAM at 783KB).
Bug 1: Use-After-Free in
search_onceLinked ListSeverity: Crash
Time to reproduce: ~46–113 seconds under load
Root cause: The 100ms timer in
_mdns_search_run()setss->state = SEARCH_OFFand then queuesACTION_SEARCH_END. Between setting the state and the END action being processed, the CHIP SDK (running on another task) callsmdns_query_async_delete(), which seesstate == SEARCH_OFFand frees the search struct. But the search is still in thesearch_oncelinked list. When the timer fires again,_mdns_search_run()traverses the list and dereferences freed memory (heap-poisoned to0xfefefefe).Fix: Add
_mdns_search_is_in_list()helper. All code paths (ACTION_SEARCH_SEND,ACTION_SEARCH_END,_mdns_search_finish_done()) now verify the search pointer is still in the linked list before accessing it. Added pointer validation (esp_ptr_internal/esp_ptr_external_ram) to detect corruption early.Bug 2: Search Leak → Permanent Action Queue Starvation
Severity: Permanent failure
Time to reproduce: ~10.4 hours under load
Root cause: In
_mdns_search_run(), when_mdns_send_search_action(ACTION_SEARCH_END, s)fails because the action queue is full, the code reverts tos->state = SEARCH_RUNNING. This means the search is never finished — it stays in thesearch_oncelinked list permanently. Every 100ms timer tick, it re-attempts to queue the END action, which fails again (the queue is congested from all the leaked searches). Eventually the action queue is permanently full and no mDNS operations work.Fix: Call
_mdns_search_finish(s)directly instead of reverting. We already holdMDNS_SERVICE_LOCK, so this is safe. The search is properly removed from the list and freed.Bug 3: Delete-Before-ADD Race (Use-After-Free)
Severity: Crash
Time to reproduce: ~46–74 seconds under load
Root cause:
mdns_query_async_new()queuesACTION_SEARCH_ADDand returns the search handle to the caller. If the caller's error path callsmdns_query_async_delete()before the ADD action is processed by the mDNS task, the search is freed. When the mDNS task later processes the ADD action, it adds the freed pointer to the linked list. The next timer tick crashes trying to accessdone_semaphoreat0xfefefefe(heap poison pattern).This is a real pattern in connectedhomeip: the CHIP platform layer calls
mdns_query_async_new(), then immediately checks for errors and may callmdns_query_async_delete()before the mDNS task processes the queued ADD.Fix:
mdns_query_async_delete()now handles three cases:_mdns_search_finish()— free it now (fixes Bug 4 below)state = SEARCH_OFFbut don't free. The ADD handler checks state and frees instead of adding.Bug 4: Finished-Search Semaphore Leak (92 bytes per query)
Severity: Steady memory leak
Time to reproduce: Immediate, ~6 allocs per 30 seconds
Root cause:
_mdns_search_finish()detaches the search from the linked list and calls the notifier. Whenmdns_query_async_delete()runs later, it can't find the search in the list. The old code assumed "not in list" means "ADD is pending" and skipped the free. But "not in list" can also mean "already finished and detached." The search struct (including its 92-byte FreeRTOS semaphore) was leaked.This affected ALL mDNS queries — both the collector's PTR rebrowse and CHIP's SRV/TXT/AAAA resolves.
Fix: Check
was_already_off = (search->state == SEARCH_OFF)before setting it. If the search was already SEARCH_OFF and not in the list, it was finished — free it. (See Bug 3 fix, case 2.)Bug 5: Lock Leak in
mdns_service_txt_set_for_host()Severity: Deadlock
Root cause: Early
return ESP_ERR_NO_MEMwhen_mdns_allocate_txt()fails doesn't releaseMDNS_SERVICE_LOCK. Next mDNS operation deadlocks.Bug 6: Results Ownership Double-Free
Severity: Crash (heap corruption)
Root cause:
mdns_query_async_get_results()returns a pointer tosearch->resultbut doesn't clear the search's copy. If the caller frees the results and then callsmdns_query_async_delete(), the same results list is freed again via_mdns_search_free().Fix: After returning results to the caller, set
search->result = NULLandsearch->num_results = 0to transfer ownership.Bug 7:
_mdns_free_action()Fallthrough Causes Double-FreeSeverity: Crash (on shutdown or queue drain)
Root cause:
ACTION_SEARCH_ADDfalls through toACTION_SEARCH_SENDandACTION_SEARCH_ENDin_mdns_free_action(). This means:mdns_query_async_delete())Fix: Separate ADD from SEND/END:
ACTION_SEARCH_ADD: Free results + search (ADD owns it before adding to list)ACTION_SEARCH_SEND/ACTION_SEARCH_END: Don't free — these reference searches in the linked list that will be freed by the normal lifecycleBug 8:
ACTION_BROWSE_SYNCQueue Overflow Under LoadSeverity: Functional failure (missed browse notifications)
Root cause: Every parsed mDNS response that matches a browse allocates a
mdns_browse_sync_t, fills it with results, and queuesACTION_BROWSE_SYNC. With 200 devices continuously advertising, this adds significant pressure to the action queue. When the queue is full, the sync is dropped silently and the browse results are lost.Fix: Replace
ACTION_BROWSE_SYNCwith inline notification at end-of-packet. After parsing all answers inmdns_parse_packet(), callbrowse->notifier()directly instead of queuing an action. This eliminates the extra allocation, queue pressure, and themdns_browse_sync_t/mdns_browse_result_sync_tdata structures entirely.Additional Improvements (in the same patch)
Action Queue Batch Drain
On single-core systems (ESP32-C3, C61), each
MDNS_SERVICE_LOCK/UNLOCKcycle is a context switch opportunity. Processing one action per lock acquisition creates a bottleneck. The fix batches up to 8 actions per lock hold, improving drain rate during congestion.RX Packet Pre-Filter (
mdns_networking_lwip.c)A lock-free pre-filter in the lwIP receive callback (
_mdns_rx_is_relevant()) drops response packets that don't contain any service names matching our browses, searches, or registered services. On a busy network with many mDNS advertisers (printers, Chromecasts, AirPlay, etc.), this dramatically reduces action queue pressure. Query packets always pass through.RX Queue Drop Logging
_mdns_send_rx_action()now logs dropped packets with rate limiting, making queue congestion issues visible.Dead Code Removal
Removed ~370 lines of
#if 0dead code:_mdns_browse_result_add_srv(),_mdns_browse_result_add_ip(),_mdns_browse_result_add_txt(),_mdns_browse_sync(),_mdns_sync_browse_result_link_free(),_mdns_copy_address_in_previous_result(), associated forward declarations,ACTION_BROWSE_SYNCenum value,mdns_browse_sync_t/mdns_browse_result_sync_tstructs, andbrowse->resultfield frommdns_browse_t.Testing
We have been torture-testing these fixes on an ESP32-C61 running a Matter controller with 200 simultaneous Matter devices (simulated via connectedhomeip example apps on Linux). The test setup generates:
mdns_browse_new()for_matter._tcpPTR recordsmdns_query_async_new()/mdns_query_async_delete()cycles for SRV, TXT, and AAAA resolutionResults after applying fixes:
Before fixes: Crashes within 46–113 seconds, permanent mDNS failure at ~10.4 hours, steady 92-byte leak per query.
How to Review
The branch https://github.com/jonsmirl/esp-protocols/tree/fix/mdns-race-conditions contains a single commit on top of the
mdns-v1.9.1tag. The diff is clean:Note: The patch also includes references to an
mdns_cachemodule (used in our application for indexed device lookup). These cache calls can be ignored or stubbed — the core race condition fixes are independent.I'm happy to help port these fixes to the refactored modular codebase if that would be useful.