-
Notifications
You must be signed in to change notification settings - Fork 87
Description
The Provision example currently crashes (with pretty high probability):
https://github.com/toitware/toit-provision/blob/53b6c211e3ea34de6ca090c3533e95cf164a3cc6/examples/ble-provision.toit
Use https://play.google.com/store/apps/details?id=com.espressif.provble on Android to test with that example.
[jaguar] INFO: program 12fc88d3-3617-b983-1f34-41ad35e7c4e4 started
For a QR code, open the following URL in a browser:
https://espressif.github.io/esp-jumpstart/qrcode.html?data={"ver":"v1","name":"PROV_6377b0","transport":"ble","username":wifiprov,"pop":abcd1234}
period-ms: 120
Heap report @ out of memory in primitive 3:4:
┌───────────┬──────────┬─────────────────────────────────────────────────────┐
│ Bytes │ Count │ Type │
├───────────┼──────────┼─────────────────────────────────────────────────────┤
│ 8 │ 1 │ external byte array │
│ 53248 │ 15 │ toit processes │
│ 16384 │ 4 │ system 0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
│ 20480 │ 5 │ current 1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
│ 16384 │ 4 │ other 2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
│ 16384 │ 1 │ heap metadata │
│ 4096 │ 1 │ spare new-space │
│ 13440 │ 46 │ lwip │
│ 7656 │ 600 │ heap overhead │
│ 1968 │ 28 │ event source │
│ 46600 │ 222 │ thread/other │
│ 30984 │ 25 │ thread/spawn │
│ 51216 │ 168 │ untagged │
│ 11616 │ 50 │ wifi │
└───────────┴──────────┴─────────────────────────────────────────────────────┘
Total: 237216 bytes in 555 allocations (93%), largest free 12k, total free 19k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (1/3)
[jaguar.http] INFO: running Jaguar device 'dark-cabinet' (id: '6db5548c-fc75-400b-860e-ec2869aa4d4a') on 'http://192.168.88.117:9000'
Heap report @ out of memory in primitive 3:4:
┌───────────┬──────────┬─────────────────────────────────────────────────────┐
│ Bytes │ Count │ Type │
├───────────┼──────────┼─────────────────────────────────────────────────────┤
│ 8 │ 1 │ external byte array │
│ 57344 │ 16 │ toit processes │
│ 20480 │ 5 │ system 0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
│ 20480 │ 5 │ current 1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
│ 16384 │ 4 │ other 2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
│ 16384 │ 1 │ heap metadata │
│ 4096 │ 1 │ spare new-space │
│ 13440 │ 46 │ lwip │
│ 7728 │ 609 │ heap overhead │
│ 1968 │ 28 │ event source │
│ 46800 │ 228 │ thread/other │
│ 30984 │ 25 │ thread/spawn │
│ 51216 │ 168 │ untagged │
│ 11616 │ 50 │ wifi │
└───────────┴──────────┴─────────────────────────────────────────────────────┘
Total: 241584 bytes in 562 allocations (95%), largest free 12k, total free 15k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (2/3)
[jaguar.http] INFO: running Jaguar device 'dark-cabinet' (id: '6db5548c-fc75-400b-860e-ec2869aa4d4a') on 'http://192.168.88.117:9000'
Heap report @ out of memory in primitive 3:4:
┌───────────┬──────────┬─────────────────────────────────────────────────────┐
│ Bytes │ Count │ Type │
├───────────┼──────────┼─────────────────────────────────────────────────────┤
│ 8 │ 1 │ external byte array │
│ 57344 │ 16 │ toit processes │
│ 20480 │ 5 │ system 0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
│ 20480 │ 5 │ current 1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
│ 16384 │ 4 │ other 2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
│ 16384 │ 1 │ heap metadata │
│ 4096 │ 1 │ spare new-space │
│ 13432 │ 46 │ lwip │
│ 7712 │ 607 │ heap overhead │
│ 1968 │ 28 │ event source │
│ 46728 │ 225 │ thread/other │
│ 30984 │ 25 │ thread/spawn │
│ 51216 │ 168 │ untagged │
│ 11616 │ 50 │ wifi │
└───────────┴──────────┴─────────────────────────────────────────────────────┘
Total: 241488 bytes in 559 allocations (95%), largest free 12k, total free 15k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (3/3)
[jaguar] INFO: backing off for 5s
[wifi] DEBUG: closing
E (37499) wifi:NAN WiFi stop
fatal: err
abort() was called at PC 0x400f671b on core 1
******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.174>
******************************************************************************
Crash in native code:
Backtrace: 0x400861be:0x3ffd3c70 0x40096245:0x3ffd3c90 0x40099e62:0x3ffd3cb0 0x400f671b:0x3ffd3d20 0x400e0925:0x3ffd3d70 0x400e0955:0x3ffd3d90 0x401b793d:0x3ffd3db0 0x400f4bf6:0x3ffd3dd0 0x400e1282:0x3ffd3df0 0x40081d1a:0x3ffd3e10 0x400f56d8:0x3ffd3ea0 0x400f5961:0x3ffd3ef0 0x400f5981:0x3ffd3f20 0x400e8b4a:0x3ffd3f40 0x400e8b71:0x3ffd3f60
0x400861be: panic_abort + 0x12
0x40096245: esp_system_abort + 0x5
0x40099e62: abort + 0x7e
0x400f671b: toit::fail(char const*, ...) + 0x57
0x400e0925: toit::WifiEvents::~WifiEvents() + 0x2d
0x400e0955: toit::WifiEvents::~WifiEvents() + 0x5
0x401b793d: toit::Resource::delete_or_mark_for_deletion() + 0x9
0x400f4bf6: toit::ResourceGroup::tear_down() + 0x4e
0x400e1282: toit::primitive_close(toit::Process*, toit::Object**) + 0x36
0x40081d1a: toit::Interpreter::run() + 0xa42
0x400f56d8: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) + 0x60
0x400f5961: toit::Scheduler::run(toit::SchedulerThread*) + 0x5d
0x400f5981: toit::SchedulerThread::entry() + 0x9
0x400e8b4a: toit::Thread::_boot() + 0x22
0x400e8b71: toit::esp_thread_start(void*) + 0x5
******************************************************************************
I haven't yet been able to build a minimal example, but there seem to be two reasons for the crash:
- Jaguar incorrectly believes that it has run out of memory and tries to shut down the module
- During shutdown the call to
esp_wifi_disconnectfails, but is guarded withFATAL_IF_NOT_ESP_OK.
OOM
This happens only if the BLE stack is active.
Jaguar runs into an OOM during udp-send (primitive 3:4).
I have traced it down to low_level_output returning an ERR_MEM: https://github.com/toitware/esp-idf/blob/262239be8bef8d90f2430d13533d789cd3f14a3b/components/esp_netif/lwip/netif/wlanif.c#L109
The call to esp_netif_transmit_wrap transitively just calls esp_wifi_internal_tx. I believe we don't have access to that code.
Note the comment on top of the low_level_output function:
* @note Returning ERR_MEM here if a DMA queue of your MAC is full can lead to
* strange results. You might consider waiting for space in the DMA queue
* to become available since the stack doesn't retry to send a packet
* dropped because of memory failure (except for the TCP timers).
*/
I added debug-prints, and the amount of data is always 303 bytes. Given that the OOM message states that there is at least 12K free, it looks like the ERR_MEM isn't for "normal" memory, but for the DMA queue.
Crash
The crash is in the destructor of the WifiEvents:
toit/src/resources/wifi_esp32.cc
Line 169 in ad90af0
| FATAL_IF_NOT_ESP_OK(err); |
The error code is 0x3002 ESP_ERR_WIFI_NOT_STARTED
It looks like running a scan makes the WiFi driver think that the driver isn't started. It's probably safe to continue to the esp_wifi_stop, although it's not clear if we could run into a race condition where the scan stops, reactivates the wifi and the esp_wifi_stop is then failing.
Example
I didn't manage to have both errors at the same time in my example.
Run this code without Jaguar active (as it will keep the network active).
import ble
import monitor
import net
import net.wifi
import net.udp
import expect show *
main:
test
scan:
channels := ByteArray 14: it + 1
ap-list := wifi.scan channels --period-per-channel-ms=120
ssids := ap-list.map: | ap/wifi.AccessPoint | ap.ssid
print ssids
spawn-udp:
spawn::
// Now try the same thing while the network is open
// and broadcasting UDP packets.
network := net.open
payload := "scan-test".to-byte-array
broadcast-address := net.IpAddress.parse "255.255.255.255"
test-port := 54541
test-address := net.SocketAddress broadcast-address test-port
datagram ::= udp.Datagram payload test-address
socket := network.udp-open
try:
socket.broadcast = true
start := Time.now
while start.to-now < (Duration --s=100):
socket.send datagram
sleep --ms=200
finally:
print_ "closing down"
socket.close
network.close
spawn-network-close:
spawn::
network := net.open
sleep --ms=8_000
network.close
test:
// Try to do a scan without any distractions.
scan
spawn-udp // <================ OOM error.
// spawn-network-close // <================= native crash.
adapter := ble.Adapter
peripheral := adapter.peripheral
advertisement := ble.Advertisement
--name="scan-test"
scan-response := ble.Advertisement
--manufacturer-specific="testing"
peripheral.start-advertise
advertisement
--scan-response=scan-response
--interval=(Duration --ms=160)
--connection-mode=ble.BLE-CONNECT-MODE-UNDIRECTIONAL
sleep --ms=200 // Give the spawned process time to start broadcasting.
start := Time.now
while start.to-now < (Duration --s=50):
scan
peripheral.close
adapter.close