-
-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Problem
bleak-retry-connector handles connection establishment but provides no mechanism for monitoring an established connection. A BLE connection can silently die where:
- BlueZ still reports
Connected=True - No disconnect callback fires
- No BLE notifications are received
- The radio link is effectively dead
The service sits forever believing it's connected, publishing stale data. Only a human looking at the logs notices the problem.
This is common on embedded Linux where RF interference, distance changes, or BlueZ bugs silently kill the radio link without generating a disconnect event.
Environment
- Victron Cerbo GX, Venus OS v3.67, BlueZ 5.x
- 2 USB BLE adapters (hci0, hci1)
- BLE devices: BMS batteries (Nordic UART GATT, expected notification every ~5s), power monitor (custom GATT)
Production Evidence
Battery BMS connection: last connect at 23:15:26, data flowing until ~00:13:00, then complete silence. Both service processes had the same thread count (3), daemon alive but producing nothing. No disconnect event in logs. BleakClient.is_connected still returned True.
This has also been observed immediately after adapter failover: a connection succeeds on the fallback adapter but goes zombie within seconds -- HCI handle present, Connected: yes, ServicesResolved: yes, but zero notification traffic.
Proposed Approach
Add a ConnectionWatchdog class that the caller uses alongside their connection:
class ConnectionWatchdog:
def __init__(
self,
timeout: float, # required -- only caller knows expected cadence
on_timeout: Callable[[], Awaitable[None]] | None = None,
client: BleakClient | None = None,
device: BLEDevice | None = None,
) -> None: ...
def notify_activity(self) -> None: ...
def start(self) -> None: ...
def stop(self) -> None: ...Key design decisions:
timeoutis required with no default -- a battery BMS expects data every ~5s (timeout 30s), a temperature sensor every ~60s (timeout 180s). No single default makes sense.- The caller calls
notify_activity()from their notification callback to reset the timer. - When
clientanddeviceare both provided, the watchdog performs BlueZ-level cleanup on timeout:client.disconnect()(with a 5s timeout to prevent hang on phantom connections) followed byclear_cache(device.address)to remove the device from BlueZ so the nextestablish_connection()starts fresh. - The
on_timeoutcallback fires after cleanup, where the caller can trigger reconnection. - When
client/deviceare not provided, the watchdog just fires the callback -- the caller handles cleanup themselves. - Uses
asyncio.Taskfor the monitoring loop -- no threads needed. - Fires once and stops. After the timeout, the caller reconnects via the callback and creates a new watchdog for the new connection.
# Usage:
watchdog = ConnectionWatchdog(
timeout=30.0,
on_timeout=my_reconnect_callback,
client=client,
device=device,
)
watchdog.start()
# In notification callback:
watchdog.notify_activity()New file watchdog.py, exported via __all__. No changes to establish_connection() or existing API.
What This Fixes
- Zombie connections: Radio link silently dies, no disconnect callback fires, no notifications arrive. Without a watchdog, the service sits indefinitely believing it's connected. With the watchdog, the dead connection is detected within
timeoutseconds, cleaned up at the BlueZ level, and the caller can reconnect. - Post-failover zombies: After adapter rotation, a connection succeeds but immediately goes zombie. The watchdog catches this within one timeout cycle.
This is complementary to bleak's built-in disconnect callback, which handles clean disconnects where BlueZ reports Connected=False. The watchdog handles the case where BlueZ never reports a disconnect -- Connected stays True but no data flows.
Reference Implementation
Branch with code and tests: feat/notification-watchdog