Skip to content

Crash in xTaskRemoveFromEventList (IDFGH-14155) #14957

@jimmyw

Description

@jimmyw

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2.2

Operating System used.

Linux

How did you build your project?

Command line with CMake

If you are using Windows, please specify command line type.

None

What is the expected behavior?

I expect it not to crash

What is the actual behavior?

In production with 1000X devices, i get 30 completely random crash reports every day, with similar pattern.

Steps to reproduce.

No specific way to reproduce, in fact have never seen it reproduced my self.

Build or installation Logs.

Example output from crash #7870

Stack trace from tiT task:

==================== CURRENT THREAD STACK =====================
#0  0x40381018 in xTaskRemoveFromEventList (pxEventList=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/tasks.c:3869
#1  0x4037e906 in xQueueGenericSend (xQueue=0x3fcd976c, pvItemToQueue=0x0, xTicksToWait=<optimized out>, xCopyPosition=0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/queue.c:993
#2  0x420ab7dd in sys_sem_signal (sem=0x3c2cfc1c) at /opt/esp/idf/components/lwip/port/freertos/sys_arch.c:134
#3  0x420accbe in lwip_netconn_do_dns_found (name=0x3c2b8438 <dns_table+36> "", ipaddr=0x3c2b8418 <dns_table+4>, arg=0x3fcb772c <s_mqtt_client_stack+5172>) at /opt/esp/idf/components/lwip/lwip/src/api/api_msg.c:2215
#4  0x42096a64 in dns_call_found (idx=0 '\000', addr=0x3c2b8418 <dns_table+4>) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1026
#5  0x42096b48 in dns_correct_response (idx=0 '\000', ttl=60) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1224
#6  0x42097784 in dns_recv (arg=<optimized out>, pcb=<optimized out>, p=0x3c2de9d4, addr=<optimized out>, port=<optimized out>) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1371
#7  0x4209e531 in udp_input (p=0x3c2de9d4, inp=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/core/udp.c:404
#8  0x420a1d49 in ip4_input (p=0x3c2de9d4, inp=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/core/ipv4/ip4.c:746
#9  0x420a7e99 in ethernet_input (p=0x3c2de9d4, netif=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/netif/ethernet.c:186
#10 0x42096318 in tcpip_thread_handle_msg (msg=0x3c2d08fc) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:174
#11 0x42096378 in tcpip_thread (arg=0x0) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:148
#12 0x4037ee08 in vPortTaskWrapper (pxCode=0x4209634c <tcpip_thread>, pvParameters=0x0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

Stack from task that does dns request:

==================== THREAD 18 (TCB: 0x3fcb7af8, name: 'mqtt_task') =====================
#0  0x400559e0 in ?? ()
#1  0x4037f0b5 in vPortClearInterruptMaskFromISR (prev_level=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:564
#2  vPortExitCritical (mux=0x3fcd97c0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:504
#3  0x4037ec50 in xQueueSemaphoreTake (xQueue=0x3fcd976c, xTicksToWait=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/queue.c:1853
#4  0x420ab80c in sys_arch_sem_wait (sem=0x3c2cfc1c, timeout=0) at /opt/esp/idf/components/lwip/port/freertos/sys_arch.c:165
#5  0x42096464 in tcpip_send_msg_wait_sem (fn=0x420ae05c <lwip_netconn_do_gethostbyname>, apimsg=0x3fcb772c <s_mqtt_client_stack+5172>, sem=0x3c2cfc1c) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:456
#6  0x420acc50 in netconn_gethostbyname_addrtype (name=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>, addr=0x3fcb7778 <s_mqtt_client_stack+5248>, dns_addrtype=2 '\002') at /opt/esp/idf/components/lwip/lwip/src/api/api_lib.c:1325
#7  0x420936e4 in lwip_getaddrinfo (nodename=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>, servname=0x0, hints=0x3fcb77d0 <s_mqtt_client_stack+5336>, res=0x3fcb77cc <s_mqtt_client_stack+5332>) at /opt/esp/idf/components/lwip/lwip/src/api/netdb.c:345
#8  0x420be50c in getaddrinfo (res=0x3fcb77cc <s_mqtt_client_stack+5332>, hints=0x3fcb77d0 <s_mqtt_client_stack+5336>, servname=0x0, nodename=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>) at /opt/esp/idf/components/lwip/include/lwip/netdb.h:23
#9  esp_tls_hostname_to_fd (host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, addr_family=<optimized out>, address=0x3fcb7854 <s_mqtt_client_stack+5468>, fd=0x3fcb782c <s_mqtt_client_stack+5428>) at /opt/esp/idf/components/esp-tls/esp_tls.c:203
#10 0x420be905 in tcp_connect (host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, error_handle=0x3c2cf728, sockfd=0x3c2e81fc) at /opt/esp/idf/components/esp-tls/esp_tls.c:340
#11 0x420bec1e in esp_tls_low_level_conn (hostname=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, tls=0x3c2e79a0) at /opt/esp/idf/components/esp-tls/esp_tls.c:440
#12 0x420befa0 in esp_tls_conn_new_sync (hostname=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, tls=0x3c2e79a0) at /opt/esp/idf/components/esp-tls/esp_tls.c:528
#13 0x420c0975 in ssl_connect (t=0x3c2cfa08, host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, port=8883, timeout_ms=10000) at /opt/esp/idf/components/tcp_transport/transport_ssl.c:111
#14 0x4213df70 in esp_transport_connect (t=0x3c2cfa08, host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, port=8883, timeout_ms=10000) at /opt/esp/idf/components/tcp_transport/transport.c:123
#15 0x42057d3a in esp_mqtt_task (pv=0x3c2e5f84) at /opt/esp/idf/components/mqtt/esp-mqtt/mqtt_client.c:1629
#16 0x4037ee08 in vPortTaskWrapper (pxCode=0x42057c28 <esp_mqtt_task>, pvParameters=0x3c2e5f84) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

Register setup:

================== CURRENT THREAD REGISTERS ===================
exccause       0x1c (LoadProhibitedCause)
excvaddr       0x4
epc1           0x40044290
epc2           0x42062d88
epc3           0x0
epc4           0x0
epc5           0x0
epc6           0x0
eps2           0x60a20
eps3           0x0
eps4           0x0
eps5           0x0
eps6           0x0
pc             0x40381018          0x40381018 <xTaskRemoveFromEventList+180>
a0             0x8037e906          -2143819514
a1             0x3fcc1fb0          1070342064
a2             0x3fcb7b10          1070299920
a3             0x0                 0
a4             0x0                 0
a5             0x1                 1
a6             0x1                 1
a7             0x3fcb7af8          1070299896
a8             0x0                 0
a9             0x3fca5678          1070225016
a10            0x3fca3b44          1070218052
a11            0xffffffff          -1
a12            0x60b23             396067
a13            0x60b23             396067
a14            0x1592              5522
a15            0xabab              43947

Code executed

queue.c:

BaseType_t xQueueGenericSend( QueueHandle_t xQueue,
                              const void * const pvItemToQueue,
                              TickType_t xTicksToWait,
                              const BaseType_t xCopyPosition )
....
                        if( listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE )
                        {
                            if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )

task.c:

    BaseType_t xTaskRemoveFromEventList( const List_t * const pxEventList )
....
                if( taskCAN_BE_SCHEDULED( pxUnblockedTCB ) == pdTRUE )
                {
                    listREMOVE_ITEM( &( pxUnblockedTCB->xStateListItem ) ); <----- HERE
                    prvAddTaskToReadyList( pxUnblockedTCB );

list.h:

#define listREMOVE_ITEM( pxItemToRemove ) \
    {                                     \
        List_t * const pxList = ( pxItemToRemove )->pxContainer;    <-- NULL     \
                                                                                 \
        ( pxItemToRemove )->pxNext->pxPrevious = ( pxItemToRemove )->pxPrevious; \
        ( pxItemToRemove )->pxPrevious->pxNext = ( pxItemToRemove )->pxNext;     \
CRASH-> if( pxList->pxIndex == ( pxItemToRemove ) )                              \
        {                                                                        \
            pxList->pxIndex = ( pxItemToRemove )->pxPrevious;                    \
        }                                                                        \
                                                                                 \
        ( pxItemToRemove )->pxContainer = NULL;                                  \
        ( pxList->uxNumberOfItems )--;                                           \
    }

Its clearly some kind of race condition, where pxContainer is NULL, and derefering the value in pxList->pxIndex Causes a read of NULL + 0x4, that causes a LoadProhibitedCause

This stack trace is not unique, this happens in other places when calling xTaskRemoveFromEventList.

Is there anyone that recognizes this, or can give me guidance on where to go from here?

The issue here is that this is so rare that i need 1000 devices to trigger the error. All my attempts to trigger it locally have failed, so i just cant try master version, or randomly change stuff.



### More Information.

_No response_

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions