Skip to content

Network Related Crash on Long Running MQTT connections #263

Open
@Spinnaker-design

Description

@Spinnaker-design

I am seeing a crash on the Portenta C33 when using an MQTT client for a long duration (~15 minutes). The crash occurs within the delay call and occurs within the lwip_task of CNetIF.cpp. It certainly looks like we are seeing a memory management issue with the networking code.

We are using an SSL Client and certificates for our server authentication.

Activity

added
type: imperfectionPerceived defect in any part of project
topic: codeRelated to content of the project itself
on Feb 14, 2024
Spinnaker-design

Spinnaker-design commented on Feb 14, 2024

@Spinnaker-design
Author

Here is the call stack for the crash:

_free_r@0x00060a0a (/_free_r.dbgasm:51)
__gnu_cxx::new_allocator<CMsg>::deallocate@0x0005ae5e (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/ext/new_allocator.h:125)
std::allocator_traits<std::allocator<CMsg> >::deallocate@0x0005ae5e (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/alloc_traits.h:462)
std::_Deque_base<CMsg, std::allocator<CMsg> >::_M_deallocate_node@0x0005ae5e (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:609)
std::_Deque_base<CMsg, std::allocator<CMsg> >::_M_destroy_nodes@0x0005ae5e (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:743)
std::_Deque_base<CMsg, std::allocator<CMsg> >::~_Deque_base@0x0005ae74 (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:665)
std::deque<CMsg, std::allocator<CMsg> >::~deque@0x0005b1c4 (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:1045)
std::queue<CMsg, std::deque<CMsg, std::allocator<CMsg> > >::~queue@0x0005b1c4 (/Users/kylevisner/.platformio/packages/toolchain-gccarmnoneeabi@1.70201.0/arm-none-eabi/include/c++/7.2.1/bits/stl_queue.h:96)
CEspCom::clearToEspQueue@0x0005b1c4 (/CEspCom::clearToEspQueue.dbgasm:109)
esp_host_there_are_data_to_be_tx@0x0005a6e4 (/esp_host_there_are_data_to_be_tx.dbgasm:12)
esp_host_spi_transaction@0x0005a6f8 (/esp_host_spi_transaction.dbgasm:5)
esp_host_perform_spi_communication@0x0005a73e (/esp_host_perform_spi_communication.dbgasm:7)
CEspControl::communicateWithEsp@0x00058ed8 (/CEspControl::communicateWithEsp.dbgasm:10)
CLwipIf::lwip_task@0x0004c0a8 (/CLwipIf::lwip_task.dbgasm:30)
CLwipIf::timer_cb@0x0004c10a (/CLwipIf::timer_cb.dbgasm:4)
r_gpt_call_callback@0x0002e174 (Unknown Source:1719)
<signal handler called>@0xffffffe9 (Unknown Source:0)
bsp_prv_software_delay_loop@0x0002f864 (/bsp_prv_software_delay_loop.dbgasm:1)
delay@0x00023c0a (/delay.dbgasm:4)
SSLClient::read@0x0001f628 (/SSLClient::read.dbgasm:8)
SSLClient::connected@0x0001f5b8 (/SSLClient::connected.dbgasm:10)
andreagilardoni

andreagilardoni commented on Feb 15, 2024

@andreagilardoni
Contributor

Thanks for your report, I got the same error while working on #234. In that PR I am trying to deal with all the network related issues, for the time being Ethernet and WiFi. I will try to address this issue with that PR.

Spinnaker-design

Spinnaker-design commented on Feb 15, 2024

@Spinnaker-design
Author

Thanks, @andreagilardoni, Is there a workaround in the mean time to unblock us until that PR is done?

andreagilardoni

andreagilardoni commented on Feb 15, 2024

@andreagilardoni
Contributor

You can try using my PR and disable the timer inside the network stack.

  • taking as reference the example here
  • You need to comment this line
  • You need to call CLwipIf::getInstance().task() inside the loop() function
  • Design your application to avoid blocking calls as much as possible

Any kind of feedback on this work is appreciated.

Spinnaker-design

Spinnaker-design commented on Feb 20, 2024

@Spinnaker-design
Author

@andreagilardoni was able to build with you PR, 2 items

  • if you comment out line 30 of CNetIf.h, you'll get a build error.
  • if you attempt to build it with CLwipIf::getInstance().task(), you'll get the following error:

Compilation error: 'class CLwipIf' has no member named 'task'

zsnave

zsnave commented on Jun 19, 2024

@zsnave

Well, after many weeks of wireless networking problems on the C33 platform, it looks like there are no fixes anytime soon. On our system we even "disable" networking after power-on (and brief use to access NTP), but the networking still causes a system hang after many hours of running (rare but fatal). It appears that there is something the class destructors are not doing correctly, since fragments of "WiFi" functionality are left operating after disconnection/shutdown. I think the advertisements for the Arduino C33 should NOT list networking, since it doesn't work correctly as yet.

jeremypy972

jeremypy972 commented on Nov 21, 2024

@jeremypy972

Hello
Have you find a way to fix this issue which is very annoying ?

Jérémy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic: codeRelated to content of the project itselftype: imperfectionPerceived defect in any part of project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @andreagilardoni@per1234@zsnave@Spinnaker-design@jeremypy972

        Issue actions

          Network Related Crash on Long Running MQTT connections · Issue #263 · arduino/ArduinoCore-renesas