-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v5.2.2
Operating System used.
Linux
How did you build your project?
Command line with CMake
If you are using Windows, please specify command line type.
None
What is the expected behavior?
I expect it not to crash
What is the actual behavior?
In production with 1000X devices, i get 30 completely random crash reports every day, with similar pattern.
Steps to reproduce.
No specific way to reproduce, in fact have never seen it reproduced my self.
Build or installation Logs.
Example output from crash #7870
Stack trace from tiT task:
==================== CURRENT THREAD STACK =====================
#0 0x40381018 in xTaskRemoveFromEventList (pxEventList=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/tasks.c:3869
#1 0x4037e906 in xQueueGenericSend (xQueue=0x3fcd976c, pvItemToQueue=0x0, xTicksToWait=<optimized out>, xCopyPosition=0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/queue.c:993
#2 0x420ab7dd in sys_sem_signal (sem=0x3c2cfc1c) at /opt/esp/idf/components/lwip/port/freertos/sys_arch.c:134
#3 0x420accbe in lwip_netconn_do_dns_found (name=0x3c2b8438 <dns_table+36> "", ipaddr=0x3c2b8418 <dns_table+4>, arg=0x3fcb772c <s_mqtt_client_stack+5172>) at /opt/esp/idf/components/lwip/lwip/src/api/api_msg.c:2215
#4 0x42096a64 in dns_call_found (idx=0 '\000', addr=0x3c2b8418 <dns_table+4>) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1026
#5 0x42096b48 in dns_correct_response (idx=0 '\000', ttl=60) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1224
#6 0x42097784 in dns_recv (arg=<optimized out>, pcb=<optimized out>, p=0x3c2de9d4, addr=<optimized out>, port=<optimized out>) at /opt/esp/idf/components/lwip/lwip/src/core/dns.c:1371
#7 0x4209e531 in udp_input (p=0x3c2de9d4, inp=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/core/udp.c:404
#8 0x420a1d49 in ip4_input (p=0x3c2de9d4, inp=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/core/ipv4/ip4.c:746
#9 0x420a7e99 in ethernet_input (p=0x3c2de9d4, netif=0x3fcf7558) at /opt/esp/idf/components/lwip/lwip/src/netif/ethernet.c:186
#10 0x42096318 in tcpip_thread_handle_msg (msg=0x3c2d08fc) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:174
#11 0x42096378 in tcpip_thread (arg=0x0) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:148
#12 0x4037ee08 in vPortTaskWrapper (pxCode=0x4209634c <tcpip_thread>, pvParameters=0x0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134
Stack from task that does dns request:
==================== THREAD 18 (TCB: 0x3fcb7af8, name: 'mqtt_task') =====================
#0 0x400559e0 in ?? ()
#1 0x4037f0b5 in vPortClearInterruptMaskFromISR (prev_level=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:564
#2 vPortExitCritical (mux=0x3fcd97c0) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:504
#3 0x4037ec50 in xQueueSemaphoreTake (xQueue=0x3fcd976c, xTicksToWait=<optimized out>) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/queue.c:1853
#4 0x420ab80c in sys_arch_sem_wait (sem=0x3c2cfc1c, timeout=0) at /opt/esp/idf/components/lwip/port/freertos/sys_arch.c:165
#5 0x42096464 in tcpip_send_msg_wait_sem (fn=0x420ae05c <lwip_netconn_do_gethostbyname>, apimsg=0x3fcb772c <s_mqtt_client_stack+5172>, sem=0x3c2cfc1c) at /opt/esp/idf/components/lwip/lwip/src/api/tcpip.c:456
#6 0x420acc50 in netconn_gethostbyname_addrtype (name=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>, addr=0x3fcb7778 <s_mqtt_client_stack+5248>, dns_addrtype=2 '\002') at /opt/esp/idf/components/lwip/lwip/src/api/api_lib.c:1325
#7 0x420936e4 in lwip_getaddrinfo (nodename=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>, servname=0x0, hints=0x3fcb77d0 <s_mqtt_client_stack+5336>, res=0x3fcb77cc <s_mqtt_client_stack+5332>) at /opt/esp/idf/components/lwip/lwip/src/api/netdb.c:345
#8 0x420be50c in getaddrinfo (res=0x3fcb77cc <s_mqtt_client_stack+5332>, hints=0x3fcb77d0 <s_mqtt_client_stack+5336>, servname=0x0, nodename=0x3c2cfa5c <error: Cannot access memory at address 0x3c2cfa5c>) at /opt/esp/idf/components/lwip/include/lwip/netdb.h:23
#9 esp_tls_hostname_to_fd (host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, addr_family=<optimized out>, address=0x3fcb7854 <s_mqtt_client_stack+5468>, fd=0x3fcb782c <s_mqtt_client_stack+5428>) at /opt/esp/idf/components/esp-tls/esp_tls.c:203
#10 0x420be905 in tcp_connect (host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, error_handle=0x3c2cf728, sockfd=0x3c2e81fc) at /opt/esp/idf/components/esp-tls/esp_tls.c:340
#11 0x420bec1e in esp_tls_low_level_conn (hostname=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, tls=0x3c2e79a0) at /opt/esp/idf/components/esp-tls/esp_tls.c:440
#12 0x420befa0 in esp_tls_conn_new_sync (hostname=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, hostlen=45, port=8883, cfg=0x3c2cf6ac, tls=0x3c2e79a0) at /opt/esp/idf/components/esp-tls/esp_tls.c:528
#13 0x420c0975 in ssl_connect (t=0x3c2cfa08, host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, port=8883, timeout_ms=10000) at /opt/esp/idf/components/tcp_transport/transport_ssl.c:111
#14 0x4213df70 in esp_transport_connect (t=0x3c2cfa08, host=0x3c2cfbb4 <error: Cannot access memory at address 0x3c2cfbb4>, port=8883, timeout_ms=10000) at /opt/esp/idf/components/tcp_transport/transport.c:123
#15 0x42057d3a in esp_mqtt_task (pv=0x3c2e5f84) at /opt/esp/idf/components/mqtt/esp-mqtt/mqtt_client.c:1629
#16 0x4037ee08 in vPortTaskWrapper (pxCode=0x42057c28 <esp_mqtt_task>, pvParameters=0x3c2e5f84) at /opt/esp/idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134
Register setup:
================== CURRENT THREAD REGISTERS ===================
exccause 0x1c (LoadProhibitedCause)
excvaddr 0x4
epc1 0x40044290
epc2 0x42062d88
epc3 0x0
epc4 0x0
epc5 0x0
epc6 0x0
eps2 0x60a20
eps3 0x0
eps4 0x0
eps5 0x0
eps6 0x0
pc 0x40381018 0x40381018 <xTaskRemoveFromEventList+180>
a0 0x8037e906 -2143819514
a1 0x3fcc1fb0 1070342064
a2 0x3fcb7b10 1070299920
a3 0x0 0
a4 0x0 0
a5 0x1 1
a6 0x1 1
a7 0x3fcb7af8 1070299896
a8 0x0 0
a9 0x3fca5678 1070225016
a10 0x3fca3b44 1070218052
a11 0xffffffff -1
a12 0x60b23 396067
a13 0x60b23 396067
a14 0x1592 5522
a15 0xabab 43947
Code executed
queue.c:
BaseType_t xQueueGenericSend( QueueHandle_t xQueue,
const void * const pvItemToQueue,
TickType_t xTicksToWait,
const BaseType_t xCopyPosition )
....
if( listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE )
{
if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )
task.c:
BaseType_t xTaskRemoveFromEventList( const List_t * const pxEventList )
....
if( taskCAN_BE_SCHEDULED( pxUnblockedTCB ) == pdTRUE )
{
listREMOVE_ITEM( &( pxUnblockedTCB->xStateListItem ) ); <----- HERE
prvAddTaskToReadyList( pxUnblockedTCB );
list.h:
#define listREMOVE_ITEM( pxItemToRemove ) \
{ \
List_t * const pxList = ( pxItemToRemove )->pxContainer; <-- NULL \
\
( pxItemToRemove )->pxNext->pxPrevious = ( pxItemToRemove )->pxPrevious; \
( pxItemToRemove )->pxPrevious->pxNext = ( pxItemToRemove )->pxNext; \
CRASH-> if( pxList->pxIndex == ( pxItemToRemove ) ) \
{ \
pxList->pxIndex = ( pxItemToRemove )->pxPrevious; \
} \
\
( pxItemToRemove )->pxContainer = NULL; \
( pxList->uxNumberOfItems )--; \
}
Its clearly some kind of race condition, where pxContainer is NULL, and derefering the value in pxList->pxIndex Causes a read of NULL + 0x4, that causes a LoadProhibitedCause
This stack trace is not unique, this happens in other places when calling xTaskRemoveFromEventList.
Is there anyone that recognizes this, or can give me guidance on where to go from here?
The issue here is that this is so rare that i need 1000 devices to trigger the error. All my attempts to trigger it locally have failed, so i just cant try master version, or randomly change stuff.
### More Information.
_No response_