-
-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Problem
establish_connection() returns a BleakClient as soon as connect() succeeds without error. But there are multiple failure modes where connect() succeeds yet the connection is non-functional:
- BlueZ reports
Connected=Truefor a phantom device (no real HCI handle), andconnect()returns immediately with a dead client - GATT service discovery silently fails, leaving
client.servicesempty - The device's application firmware has crashed -- BLE link is alive but no notification responses arrive
- The connection is genuine but the device requires a specific handshake to be usable
Applications must implement their own post-connect validation and reconnect loop on top of establish_connection(), duplicating the retry/backoff logic the library already provides.
Environment
- Victron Cerbo GX, Venus OS v3.67, BlueZ 5.x
- 2 USB BLE adapters (hci0, hci1)
- BLE devices: BMS batteries (Nordic UART GATT), power monitor (custom GATT), relay switches (Telink TLSR8266, SPP GATT)
Production Evidence
Service calls establish_connection(), gets a client, tries to use it:
await client.start_notify(UART_RX_UUID, callback)
→ BleakCharacteristicNotFoundError (GATT services empty)
Or:
data = await client.read_gatt_char(STATUS_UUID)
→ [org.bluez.Error.Failed] Not connected (phantom)
In both cases the application must catch the error, disconnect, and call establish_connection() again -- reimplementing retry logic the library already has.
Proposed Approach
An optional validate_connection callback parameter on establish_connection():
async def establish_connection(
...,
validate_connection: Callable[[AnyBleakClient], Awaitable[bool]] | None = None,
**kwargs,
) -> AnyBleakClient:After every successful connect():
- If
validate_connectionisNone(default), behavior is unchanged. - If provided, call
await validate_connection(client). - If it returns
True, return the client. - If it returns
Falseor raises any exception, disconnect the client, count as a connect error, and retry untilmax_attemptsis exhausted.
Any exception from the callback is caught and treated as False.
Usage notes:
- The callback must be async (
Callable[[BleakClient], Awaitable[bool]]). - The callback should include its own timeout for GATT operations (e.g., wrap reads in
asyncio.wait_for(..., timeout=5.0)), sinceestablish_connection()does not enforce a callback timeout. - Validation failures count against
max_attempts, sharing the retry budget with connect failures. Callers with flaky validators should increasemax_attempts.
Example:
async def validate(client: BleakClient) -> bool:
data = await asyncio.wait_for(
client.read_gatt_char("0000fff1-..."), timeout=5.0
)
return len(data) > 0
client = await establish_connection(
BleakClient, device, "my-device",
validate_connection=validate,
)No behavior change when validate_connection is not provided. No new dependencies.
What This Fixes
- Phantom connection adoption:
connect()"succeeds" on a phantom but the callback's GATT operation fails, triggering disconnect and retry instead of returning a broken client. - Silent GATT discovery failure: The callback can check
client.servicesor attemptstart_notify()-- empty services cause retry. - Application-level handshake failures: Services that require a command/response sequence can validate it using the library's existing retry budget.
- Dead device firmware: If the device's BLE stack is alive but its application firmware crashed, a command expecting a response times out in the callback, triggering retry.
This is complementary to pre-connect checks (like detecting inactive connections via ServicesResolved): pre-connect checks prevent adopting known-bad connections, while validate_connection catches any post-connect failure the pre-connect checks missed.
Reference Implementation
Branch with code and tests: feat/validate-connection
Related Upstream Issues
- #107 — Cache should expire when services are removed: Reported a
KeyError: 'org.bluez.GattService1'when cached services became stale. Avalidate_connectioncallback that checksclient.servicesor attempts a GATT read would catch this condition and trigger a retry instead of returning a broken client.