Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE + WiFi Scan + UDP = crash #2714

Open
floitsch opened this issue Feb 17, 2025 · 0 comments
Open

BLE + WiFi Scan + UDP = crash #2714

floitsch opened this issue Feb 17, 2025 · 0 comments

Comments

@floitsch
Copy link
Member

floitsch commented Feb 17, 2025

The Provision example currently crashes (with pretty high probability):
https://github.com/toitware/toit-provision/blob/53b6c211e3ea34de6ca090c3533e95cf164a3cc6/examples/ble-provision.toit

Use https://play.google.com/store/apps/details?id=com.espressif.provble on Android to test with that example.

[jaguar] INFO: program 12fc88d3-3617-b983-1f34-41ad35e7c4e4 started
For a QR code, open the following URL in a browser:

https://espressif.github.io/esp-jumpstart/qrcode.html?data={"ver":"v1","name":"PROV_6377b0","transport":"ble","username":wifiprov,"pop":abcd1234}
period-ms: 120
Heap report @ out of memory in primitive 3:4:
  ┌───────────┬──────────┬─────────────────────────────────────────────────────┐
  │   Bytes   │  Count   │  Type                                               │
  ├───────────┼──────────┼─────────────────────────────────────────────────────┤
  │       8   │      1   │  external byte array                                │
  │   53248   │     15   │  toit processes                                     │
  │     16384 │        4 │    system    0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
  │     20480 │        5 │    current   1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
  │     16384 │        4 │    other     2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
  │   16384   │      1   │  heap metadata                                      │
  │    4096   │      1   │  spare new-space                                    │
  │   13440   │     46   │  lwip                                               │
  │    7656   │    600   │  heap overhead                                      │
  │    1968   │     28   │  event source                                       │
  │   46600   │    222   │  thread/other                                       │
  │   30984   │     25   │  thread/spawn                                       │
  │   51216   │    168   │  untagged                                           │
  │   11616   │     50   │  wifi                                               │
  └───────────┴──────────┴─────────────────────────────────────────────────────┘
  Total: 237216 bytes in 555 allocations (93%), largest free 12k, total free 19k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (1/3)
[jaguar.http] INFO: running Jaguar device 'dark-cabinet' (id: '6db5548c-fc75-400b-860e-ec2869aa4d4a') on 'http://192.168.88.117:9000'
Heap report @ out of memory in primitive 3:4:
  ┌───────────┬──────────┬─────────────────────────────────────────────────────┐
  │   Bytes   │  Count   │  Type                                               │
  ├───────────┼──────────┼─────────────────────────────────────────────────────┤
  │       8   │      1   │  external byte array                                │
  │   57344   │     16   │  toit processes                                     │
  │     20480 │        5 │    system    0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
  │     20480 │        5 │    current   1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
  │     16384 │        4 │    other     2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
  │   16384   │      1   │  heap metadata                                      │
  │    4096   │      1   │  spare new-space                                    │
  │   13440   │     46   │  lwip                                               │
  │    7728   │    609   │  heap overhead                                      │
  │    1968   │     28   │  event source                                       │
  │   46800   │    228   │  thread/other                                       │
  │   30984   │     25   │  thread/spawn                                       │
  │   51216   │    168   │  untagged                                           │
  │   11616   │     50   │  wifi                                               │
  └───────────┴──────────┴─────────────────────────────────────────────────────┘
  Total: 241584 bytes in 562 allocations (95%), largest free 12k, total free 15k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (2/3)
[jaguar.http] INFO: running Jaguar device 'dark-cabinet' (id: '6db5548c-fc75-400b-860e-ec2869aa4d4a') on 'http://192.168.88.117:9000'
Heap report @ out of memory in primitive 3:4:
  ┌───────────┬──────────┬─────────────────────────────────────────────────────┐
  │   Bytes   │  Count   │  Type                                               │
  ├───────────┼──────────┼─────────────────────────────────────────────────────┤
  │       8   │      1   │  external byte array                                │
  │   57344   │     16   │  toit processes                                     │
  │     20480 │        5 │    system    0 c5a862de-48ea-7d8d-fe94-43b66a0d649a │
  │     20480 │        5 │    current   1 a777996d-4f76-abbb-a2f6-0ae36a60bea4 │
  │     16384 │        4 │    other     2 12fc88d3-3617-b983-1f34-41ad35e7c4e4 │
  │   16384   │      1   │  heap metadata                                      │
  │    4096   │      1   │  spare new-space                                    │
  │   13432   │     46   │  lwip                                               │
  │    7712   │    607   │  heap overhead                                      │
  │    1968   │     28   │  event source                                       │
  │   46728   │    225   │  thread/other                                       │
  │   30984   │     25   │  thread/spawn                                       │
  │   51216   │    168   │  untagged                                           │
  │   11616   │     50   │  wifi                                               │
  └───────────┴──────────┴─────────────────────────────────────────────────────┘
  Total: 241488 bytes in 559 allocations (95%), largest free 12k, total free 15k
[jaguar] WARN: running Jaguar failed due to 'MALLOC_FAILED' (3/3)
[jaguar] INFO: backing off for 5s
[wifi] DEBUG: closing
E (37499) wifi:NAN WiFi stop
fatal: err

abort() was called at PC 0x400f671b on core 1



******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.174>
******************************************************************************
Crash in native code:
Backtrace: 0x400861be:0x3ffd3c70 0x40096245:0x3ffd3c90 0x40099e62:0x3ffd3cb0 0x400f671b:0x3ffd3d20 0x400e0925:0x3ffd3d70 0x400e0955:0x3ffd3d90 0x401b793d:0x3ffd3db0 0x400f4bf6:0x3ffd3dd0 0x400e1282:0x3ffd3df0 0x40081d1a:0x3ffd3e10 0x400f56d8:0x3ffd3ea0 0x400f5961:0x3ffd3ef0 0x400f5981:0x3ffd3f20 0x400e8b4a:0x3ffd3f40 0x400e8b71:0x3ffd3f60
0x400861be: panic_abort + 0x12
0x40096245: esp_system_abort + 0x5
0x40099e62: abort + 0x7e
0x400f671b: toit::fail(char const*, ...) + 0x57
0x400e0925: toit::WifiEvents::~WifiEvents() + 0x2d
0x400e0955: toit::WifiEvents::~WifiEvents() + 0x5
0x401b793d: toit::Resource::delete_or_mark_for_deletion() + 0x9
0x400f4bf6: toit::ResourceGroup::tear_down() + 0x4e
0x400e1282: toit::primitive_close(toit::Process*, toit::Object**) + 0x36
0x40081d1a: toit::Interpreter::run() + 0xa42
0x400f56d8: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) + 0x60
0x400f5961: toit::Scheduler::run(toit::SchedulerThread*) + 0x5d
0x400f5981: toit::SchedulerThread::entry() + 0x9
0x400e8b4a: toit::Thread::_boot() + 0x22
0x400e8b71: toit::esp_thread_start(void*) + 0x5
******************************************************************************

I haven't yet been able to build a minimal example, but there seem to be two reasons for the crash:

  1. Jaguar incorrectly believes that it has run out of memory and tries to shut down the module
  2. During shutdown the call to esp_wifi_disconnect fails, but is guarded with FATAL_IF_NOT_ESP_OK.

OOM

This happens only if the BLE stack is active.

Jaguar runs into an OOM during udp-send (primitive 3:4).
I have traced it down to low_level_output returning an ERR_MEM: https://github.com/toitware/esp-idf/blob/262239be8bef8d90f2430d13533d789cd3f14a3b/components/esp_netif/lwip/netif/wlanif.c#L109

The call to esp_netif_transmit_wrap transitively just calls esp_wifi_internal_tx. I believe we don't have access to that code.

Note the comment on top of the low_level_output function:

 * @note Returning ERR_MEM here if a DMA queue of your MAC is full can lead to
 *       strange results. You might consider waiting for space in the DMA queue
 *       to become available since the stack doesn't retry to send a packet
 *       dropped because of memory failure (except for the TCP timers).
 */

I added debug-prints, and the amount of data is always 303 bytes. Given that the OOM message states that there is at least 12K free, it looks like the ERR_MEM isn't for "normal" memory, but for the DMA queue.

Crash

The crash is in the destructor of the WifiEvents:

FATAL_IF_NOT_ESP_OK(err);

The error code is 0x3002 ESP_ERR_WIFI_NOT_STARTED

It looks like running a scan makes the WiFi driver think that the driver isn't started. It's probably safe to continue to the esp_wifi_stop, although it's not clear if we could run into a race condition where the scan stops, reactivates the wifi and the esp_wifi_stop is then failing.

Example

I didn't manage to have both errors at the same time in my example.
Run this code without Jaguar active (as it will keep the network active).

import ble
import monitor
import net
import net.wifi
import net.udp
import expect show *

import .test

main:
  run-test: test

scan:
  channels := ByteArray 14: it + 1
  ap-list := wifi.scan channels --period-per-channel-ms=120
  ssids := ap-list.map: | ap/wifi.AccessPoint | ap.ssid
  print ssids

spawn-udp:
  spawn::
    // Now try the same thing while the network is open
    // and broadcasting UDP packets.
    network := net.open
    payload := "scan-test".to-byte-array
    broadcast-address := net.IpAddress.parse "255.255.255.255"
    test-port := 54541
    test-address := net.SocketAddress broadcast-address test-port
    datagram ::= udp.Datagram payload test-address
    socket := network.udp-open
    try:
      socket.broadcast = true
      start := Time.now
      while start.to-now < (Duration --s=100):
        socket.send datagram
        sleep --ms=200
    finally:
      print_ "closing down"
      socket.close
      network.close

spawn-network-close:
  spawn::
    network := net.open
    sleep --ms=8_000
    network.close

test:
  // Try to do a scan without any distractions.
  scan

  spawn-udp  // <================ OOM error.
  // spawn-network-close   // <================= native crash.

  adapter := ble.Adapter
  peripheral := adapter.peripheral
  advertisement := ble.Advertisement
      --name="scan-test"
  scan-response := ble.Advertisement
      --manufacturer-specific="testing"
  peripheral.start-advertise
      advertisement
      --scan-response=scan-response
      --interval=(Duration --ms=160)
      --connection-mode=ble.BLE-CONNECT-MODE-UNDIRECTIONAL

  sleep --ms=200  // Give the spawned process time to start broadcasting.
  start := Time.now
  while start.to-now < (Duration --s=50):
    scan
  peripheral.close
  adapter.close
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant