No mechanism to detect zombie connections -- silent radio death goes undetected

### Problem

`bleak-retry-connector` handles connection *establishment* but provides no mechanism for monitoring an *established* connection. A BLE connection can silently die where:

- BlueZ still reports `Connected=True`
- No disconnect callback fires
- No BLE notifications are received
- The radio link is effectively dead

The service sits forever believing it's connected, publishing stale data. Only a human looking at the logs notices the problem.

This is common on embedded Linux where RF interference, distance changes, or BlueZ bugs silently kill the radio link without generating a disconnect event.

### Environment

- Victron Cerbo GX, Venus OS v3.67, BlueZ 5.x
- 2 USB BLE adapters (hci0, hci1)
- BLE devices: BMS batteries (Nordic UART GATT, expected notification every ~5s), power monitor (custom GATT)

### Production Evidence

Battery BMS connection: last connect at 23:15:26, data flowing until ~00:13:00, then complete silence. Both service processes had the same thread count (3), daemon alive but producing nothing. No disconnect event in logs. `BleakClient.is_connected` still returned `True`.

This has also been observed immediately after adapter failover: a connection succeeds on the fallback adapter but goes zombie within seconds -- HCI handle present, `Connected: yes`, `ServicesResolved: yes`, but zero notification traffic.

### Proposed Approach

Add a `ConnectionWatchdog` class that the caller uses alongside their connection:

```python
class ConnectionWatchdog:
    def __init__(
        self,
        timeout: float,          # required -- only caller knows expected cadence
        on_timeout: Callable[[], Awaitable[None]] | None = None,
        client: BleakClient | None = None,
        device: BLEDevice | None = None,
    ) -> None: ...

    def notify_activity(self) -> None: ...
    def start(self) -> None: ...
    def stop(self) -> None: ...
```

**Key design decisions:**

- `timeout` is required with no default -- a battery BMS expects data every ~5s (timeout 30s), a temperature sensor every ~60s (timeout 180s). No single default makes sense.
- The caller calls `notify_activity()` from their notification callback to reset the timer.
- When `client` and `device` are both provided, the watchdog performs BlueZ-level cleanup on timeout: `client.disconnect()` (with a 5s timeout to prevent hang on phantom connections) followed by `clear_cache(device.address)` to remove the device from BlueZ so the next `establish_connection()` starts fresh.
- The `on_timeout` callback fires after cleanup, where the caller can trigger reconnection.
- When `client`/`device` are not provided, the watchdog just fires the callback -- the caller handles cleanup themselves.
- Uses `asyncio.Task` for the monitoring loop -- no threads needed.
- Fires once and stops. After the timeout, the caller reconnects via the callback and creates a new watchdog for the new connection.

```python
# Usage:
watchdog = ConnectionWatchdog(
    timeout=30.0,
    on_timeout=my_reconnect_callback,
    client=client,
    device=device,
)
watchdog.start()

# In notification callback:
watchdog.notify_activity()
```

New file `watchdog.py`, exported via `__all__`. No changes to `establish_connection()` or existing API.

### What This Fixes

- **Zombie connections**: Radio link silently dies, no disconnect callback fires, no notifications arrive. Without a watchdog, the service sits indefinitely believing it's connected. With the watchdog, the dead connection is detected within `timeout` seconds, cleaned up at the BlueZ level, and the caller can reconnect.
- **Post-failover zombies**: After adapter rotation, a connection succeeds but immediately goes zombie. The watchdog catches this within one timeout cycle.

This is complementary to bleak's built-in disconnect callback, which handles clean disconnects where BlueZ reports `Connected=False`. The watchdog handles the case where BlueZ *never* reports a disconnect -- `Connected` stays `True` but no data flows.

### Reference Implementation

Branch with code and tests: [`feat/notification-watchdog`](https://github.com/TechBlueprints/bleak-retry-connector/tree/feat/notification-watchdog)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No mechanism to detect zombie connections -- silent radio death goes undetected #228

Problem

Environment

Production Evidence

Proposed Approach

What This Fixes

Reference Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

No mechanism to detect zombie connections -- silent radio death goes undetected #228

Description

Problem

Environment

Production Evidence

Proposed Approach

What This Fixes

Reference Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions