Skip to content

No mechanism to detect zombie connections -- silent radio death goes undetected #228

@cgoudie

Description

@cgoudie

Problem

bleak-retry-connector handles connection establishment but provides no mechanism for monitoring an established connection. A BLE connection can silently die where:

  • BlueZ still reports Connected=True
  • No disconnect callback fires
  • No BLE notifications are received
  • The radio link is effectively dead

The service sits forever believing it's connected, publishing stale data. Only a human looking at the logs notices the problem.

This is common on embedded Linux where RF interference, distance changes, or BlueZ bugs silently kill the radio link without generating a disconnect event.

Environment

  • Victron Cerbo GX, Venus OS v3.67, BlueZ 5.x
  • 2 USB BLE adapters (hci0, hci1)
  • BLE devices: BMS batteries (Nordic UART GATT, expected notification every ~5s), power monitor (custom GATT)

Production Evidence

Battery BMS connection: last connect at 23:15:26, data flowing until ~00:13:00, then complete silence. Both service processes had the same thread count (3), daemon alive but producing nothing. No disconnect event in logs. BleakClient.is_connected still returned True.

This has also been observed immediately after adapter failover: a connection succeeds on the fallback adapter but goes zombie within seconds -- HCI handle present, Connected: yes, ServicesResolved: yes, but zero notification traffic.

Proposed Approach

Add a ConnectionWatchdog class that the caller uses alongside their connection:

class ConnectionWatchdog:
    def __init__(
        self,
        timeout: float,          # required -- only caller knows expected cadence
        on_timeout: Callable[[], Awaitable[None]] | None = None,
        client: BleakClient | None = None,
        device: BLEDevice | None = None,
    ) -> None: ...

    def notify_activity(self) -> None: ...
    def start(self) -> None: ...
    def stop(self) -> None: ...

Key design decisions:

  • timeout is required with no default -- a battery BMS expects data every ~5s (timeout 30s), a temperature sensor every ~60s (timeout 180s). No single default makes sense.
  • The caller calls notify_activity() from their notification callback to reset the timer.
  • When client and device are both provided, the watchdog performs BlueZ-level cleanup on timeout: client.disconnect() (with a 5s timeout to prevent hang on phantom connections) followed by clear_cache(device.address) to remove the device from BlueZ so the next establish_connection() starts fresh.
  • The on_timeout callback fires after cleanup, where the caller can trigger reconnection.
  • When client/device are not provided, the watchdog just fires the callback -- the caller handles cleanup themselves.
  • Uses asyncio.Task for the monitoring loop -- no threads needed.
  • Fires once and stops. After the timeout, the caller reconnects via the callback and creates a new watchdog for the new connection.
# Usage:
watchdog = ConnectionWatchdog(
    timeout=30.0,
    on_timeout=my_reconnect_callback,
    client=client,
    device=device,
)
watchdog.start()

# In notification callback:
watchdog.notify_activity()

New file watchdog.py, exported via __all__. No changes to establish_connection() or existing API.

What This Fixes

  • Zombie connections: Radio link silently dies, no disconnect callback fires, no notifications arrive. Without a watchdog, the service sits indefinitely believing it's connected. With the watchdog, the dead connection is detected within timeout seconds, cleaned up at the BlueZ level, and the caller can reconnect.
  • Post-failover zombies: After adapter rotation, a connection succeeds but immediately goes zombie. The watchdog catches this within one timeout cycle.

This is complementary to bleak's built-in disconnect callback, which handles clean disconnects where BlueZ reports Connected=False. The watchdog handles the case where BlueZ never reports a disconnect -- Connected stays True but no data flows.

Reference Implementation

Branch with code and tests: feat/notification-watchdog

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions