Skip to content

establish_connection() has no way to validate a connection is actually usable before returning it #232

@cgoudie

Description

@cgoudie

Problem

establish_connection() returns a BleakClient as soon as connect() succeeds without error. But there are multiple failure modes where connect() succeeds yet the connection is non-functional:

  • BlueZ reports Connected=True for a phantom device (no real HCI handle), and connect() returns immediately with a dead client
  • GATT service discovery silently fails, leaving client.services empty
  • The device's application firmware has crashed -- BLE link is alive but no notification responses arrive
  • The connection is genuine but the device requires a specific handshake to be usable

Applications must implement their own post-connect validation and reconnect loop on top of establish_connection(), duplicating the retry/backoff logic the library already provides.

Environment

  • Victron Cerbo GX, Venus OS v3.67, BlueZ 5.x
  • 2 USB BLE adapters (hci0, hci1)
  • BLE devices: BMS batteries (Nordic UART GATT), power monitor (custom GATT), relay switches (Telink TLSR8266, SPP GATT)

Production Evidence

Service calls establish_connection(), gets a client, tries to use it:

await client.start_notify(UART_RX_UUID, callback)
→ BleakCharacteristicNotFoundError  (GATT services empty)

Or:

data = await client.read_gatt_char(STATUS_UUID)
→ [org.bluez.Error.Failed] Not connected  (phantom)

In both cases the application must catch the error, disconnect, and call establish_connection() again -- reimplementing retry logic the library already has.

Proposed Approach

An optional validate_connection callback parameter on establish_connection():

async def establish_connection(
    ...,
    validate_connection: Callable[[AnyBleakClient], Awaitable[bool]] | None = None,
    **kwargs,
) -> AnyBleakClient:

After every successful connect():

  1. If validate_connection is None (default), behavior is unchanged.
  2. If provided, call await validate_connection(client).
  3. If it returns True, return the client.
  4. If it returns False or raises any exception, disconnect the client, count as a connect error, and retry until max_attempts is exhausted.

Any exception from the callback is caught and treated as False.

Usage notes:

  • The callback must be async (Callable[[BleakClient], Awaitable[bool]]).
  • The callback should include its own timeout for GATT operations (e.g., wrap reads in asyncio.wait_for(..., timeout=5.0)), since establish_connection() does not enforce a callback timeout.
  • Validation failures count against max_attempts, sharing the retry budget with connect failures. Callers with flaky validators should increase max_attempts.

Example:

async def validate(client: BleakClient) -> bool:
    data = await asyncio.wait_for(
        client.read_gatt_char("0000fff1-..."), timeout=5.0
    )
    return len(data) > 0

client = await establish_connection(
    BleakClient, device, "my-device",
    validate_connection=validate,
)

No behavior change when validate_connection is not provided. No new dependencies.

What This Fixes

  • Phantom connection adoption: connect() "succeeds" on a phantom but the callback's GATT operation fails, triggering disconnect and retry instead of returning a broken client.
  • Silent GATT discovery failure: The callback can check client.services or attempt start_notify() -- empty services cause retry.
  • Application-level handshake failures: Services that require a command/response sequence can validate it using the library's existing retry budget.
  • Dead device firmware: If the device's BLE stack is alive but its application firmware crashed, a command expecting a response times out in the callback, triggering retry.

This is complementary to pre-connect checks (like detecting inactive connections via ServicesResolved): pre-connect checks prevent adopting known-bad connections, while validate_connection catches any post-connect failure the pre-connect checks missed.

Reference Implementation

Branch with code and tests: feat/validate-connection


Related Upstream Issues

  • #107 — Cache should expire when services are removed: Reported a KeyError: 'org.bluez.GattService1' when cached services became stale. A validate_connection callback that checks client.services or attempts a GATT read would catch this condition and trigger a retry instead of returning a broken client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions