Skip to content

[Binance] WS client fails to recover from server-initiated close mid-pong #4020

@M-at-ti-a

Description

@M-at-ti-a

Summary

BinanceWebSocketClient.send_pong() raises an uncaught RuntimeError("Cannot send pong: connection not active") when the underlying Rust WebSocket has been closed between the ping arriving and the pong-send task running. The exception:

  1. escapes the try / except WebSocketClientError in send_pong, because the underlying nautilus_pyo3.WebSocketClient raises RuntimeError (not WebSocketClientError) for this condition;
  2. propagates out of the fire-and-forget task scheduled by _handle_ping, where it surfaces as an unhandled-exception log and the task finishes;
  3. never triggers _handle_reconnect, because that callback fires only via the Rust client's post_reconnection hook on a successful re-handshake — not on a peer-initiated close that we observed mid-pong.

The connection state remains "dead". No automatic reconnect is attempted from the Python layer; subsequent inbound frames stop arriving. For the Binance USDT-Futures market-data WS in our setup, this means Strategy.on_bar is never invoked again until the host process is restarted.

We observed two such stalls within a two-hour window on 2026-05-06 (DEV, demo testnet). Each lasted ~49 minutes until our process-level watchdog hit threshold and forced a Docker restart. The same Binance keepalive ping timeout that triggered NT's stall was recovered from in ~6 seconds by a parallel websockets-library client running in the same process — so this is specifically NT's reconnect path that fails.

Affected versions

  • nautilus_trader==1.226.0 (verified in production).
  • develop HEAD as of 2026-05-06 (verified by inspecting nautilus_trader/adapters/binance/websocket/client.py on the develop branch — the same try/except shape is still present).

The exception type contract between nautilus_pyo3.WebSocketClient.send_pong and the Python wrapper has not changed in either direction across this window.

Root cause — exact code path

nautilus_trader/adapters/binance/websocket/client.py (develop, lines 256–274):

def _handle_ping(self, client_id: int, raw: bytes) -> None:
    task = self._loop.create_task(self.send_pong(client_id, raw))
    self._tasks.add(task)

async def send_pong(self, client_id: int, raw: bytes) -> None:
    """
    Send the given raw payload to the server as a PONG message.
    """
    client = self._clients.get(client_id)
    if client is None:
        return

    try:
        await client.send_pong(raw)
    except WebSocketClientError as e:
        self._log.error(f"ws-client {client_id}: {e!s}")

Three structural issues compound:

  1. Exception-type mismatch. The underlying nautilus_pyo3.WebSocketClient.send_pong() raises Python RuntimeError("Cannot send pong: connection not active") when the Rust-side connection state is not Active. The wrapper above catches only WebSocketClientError, so the RuntimeError propagates.

  2. Fire-and-forget task with no exception handler. _handle_ping schedules send_pong as self._loop.create_task(...) and only stores the task in a WeakSet. There is no task.add_done_callback(...) to inspect task.exception(). When the RuntimeError propagates, asyncio surfaces it via the default unhandled-exception handler, the task finishes, and the WeakSet drops it. Nothing triggers a reconnect.

  3. Reconnect path doesn't cover this case. _handle_reconnect is wired into the Rust client as post_reconnection=lambda: self._handle_reconnect(client_id). It fires only after the Rust layer has successfully re-handshaked. The pong-on-dead-connection path observed here is the opposite — the connection has dropped and the Rust layer hasn't yet entered a reconnect cycle from the Python layer's perspective. Result: _handle_reconnect never fires for this manifestation.

Observed traceback

future: <Task finished name='Task-4001' coro=<BinanceWebSocketClient.send_pong() done,
    defined at /usr/local/lib/python3.12/site-packages/nautilus_trader/adapters/binance/websocket/client.py:260>
    exception=RuntimeError('Cannot send pong: connection not active')>
  File "/usr/local/lib/python3.12/site-packages/nautilus_trader/adapters/binance/websocket/client.py", line 269, in send_pong
    await client.send_pong(raw)
RuntimeError: Cannot send pong: connection not active

(Line 269 in 1.226.0 is the await client.send_pong(raw) call.)

Reproduction context

Production setup:

  • nautilus_trader==1.226.0, Python 3.12.13, Debian GNU/Linux 13 (Docker, kernel 6.19.13), single TradingNode.
  • BinanceLiveDataClient configured for USDT Futures (demo testnet, wss://stream.binancefuture.com/market).
  • Four perpetual instruments subscribed to 5m bar streams plus mark-price and aggTrade.
  • Strategy registers on_bar(bar: Bar) and is the sole bar consumer.

The trigger is a routine Binance server-side keepalive ping timeout — not a deliberate client close. From the venue's perspective this is a normal stale-connection cleanup that happens periodically across all sessions; the bug is that NT's pong path doesn't survive the race between Binance closing the socket and NT serving the pong.

Two failure events within two hours on 2026-05-06 UTC, both on the same NT instance:

07:36:22  bar pipeline stalled — last on_bar was 3683s ago (61 min)
09:49:19  bar pipeline stalled — last on_bar was 2959s ago (49 min)

Smoking-gun comparison from the second event. All three of these are running in the same Python process, against the same network, against Binance:

09:00:08  binance_spot_ws (websockets-lib, our code)
            ConnectionClosedError: no close frame received or sent
            -> reconnect loop fires, healthy in 6 seconds

09:00:38  binance_futures_ws (websockets-lib, our code)
            ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout
            -> reconnect loop fires, healthy in 6 seconds

09:00:??  BinanceWebSocketClient (nautilus_trader)
            RuntimeError('Cannot send pong: connection not active')
            -> NO reconnect attempted; stays dead
            -> on_bar callbacks STOP for 49 minutes

09:49:24  Host process restarted by external watchdog
09:49:30  NT recreates the WS connection during fresh boot

The same Binance event (keepalive ping timeout, server-initiated 1011 close) drops both the bot's own clients and NT's. Only NT fails to recover.

Suggested fixes (open to alternatives)

The minimal-blast-radius fix is in nautilus_trader/adapters/binance/websocket/client.py itself; ideally combined with a slightly stronger contract on the Rust side.

(a) Catch the actual exception type the Rust layer raises. Either broaden the Python except clause, or have nautilus_pyo3.WebSocketClient.send_pong raise WebSocketClientError for closed-connection conditions instead of bare RuntimeError. Example of the Python-side change:

async def send_pong(self, client_id: int, raw: bytes) -> None:
    client = self._clients.get(client_id)
    if client is None:
        return
    try:
        await client.send_pong(raw)
    except WebSocketClientError as e:
        self._log.error(f"ws-client {client_id}: {e!s}")
    except RuntimeError as e:
        # Rust layer raises bare RuntimeError when the connection is
        # closed between ping arrival and pong send. Treat as a
        # benign race — the close path will (or already did) trigger
        # the Rust reconnect cycle, which fires post_reconnection.
        self._log.warning(
            f"ws-client {client_id}: send_pong skipped on closed connection: {e!s}",
        )

This stops the unhandled-exception log noise but on its own does not fix the no-reconnect symptom.

(b) Trigger an explicit reconnect when send_pong observes a dead connection. Detect the closed state at this seam and force the reconnect path the same way _handle_reconnect would after a normal post_reconnection. This is the structurally correct fix — the symptom we hit is "NT learned the connection was closed and did nothing about it".

(c) Attach a done_callback to the fire-and-forget task in _handle_ping so unhandled exceptions in send_pong are surfaced to a single chokepoint that can decide whether to reconnect or just log. This pairs naturally with (a)/(b).

The fix should also make nautilus_pyo3.WebSocketClient.send_pong either silent or WebSocketClientError-raising for "connection not active" — bare RuntimeError is hard for callers to match without overcatching.

Possibly related

  • 1.226.0 release notes mention a Bybit-specific pong-frame fix (Ignore JSON pong websocket frames in Bybit #3936) and a WebSocketClient migration onto a WsTransport trait. The same exception-type-mismatch shape may exist in adapters that share the same Python wrapper; we have not exercised them.
  • The Python websockets library has historically raised ConnectionClosedError (a subclass of Exception, not RuntimeError) for this condition. If nautilus_pyo3 is wrapping a Rust WS implementation with similar semantics, a Python-side translation layer would be a clean place to normalise the exception type contract.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions