You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BinanceWebSocketClient.send_pong() raises an uncaughtRuntimeError("Cannot send pong: connection not active") when the underlying Rust WebSocket has been closed between the ping arriving and the pong-send task running. The exception:
escapes the try / except WebSocketClientError in send_pong, because the underlying nautilus_pyo3.WebSocketClient raises RuntimeError (not WebSocketClientError) for this condition;
propagates out of the fire-and-forget task scheduled by _handle_ping, where it surfaces as an unhandled-exception log and the task finishes;
never triggers _handle_reconnect, because that callback fires only via the Rust client's post_reconnection hook on a successful re-handshake — not on a peer-initiated close that we observed mid-pong.
The connection state remains "dead". No automatic reconnect is attempted from the Python layer; subsequent inbound frames stop arriving. For the Binance USDT-Futures market-data WS in our setup, this means Strategy.on_bar is never invoked again until the host process is restarted.
We observed two such stalls within a two-hour window on 2026-05-06 (DEV, demo testnet). Each lasted ~49 minutes until our process-level watchdog hit threshold and forced a Docker restart. The same Binance keepalive ping timeout that triggered NT's stall was recovered from in ~6 seconds by a parallel websockets-library client running in the same process — so this is specifically NT's reconnect path that fails.
Affected versions
nautilus_trader==1.226.0 (verified in production).
develop HEAD as of 2026-05-06 (verified by inspecting nautilus_trader/adapters/binance/websocket/client.py on the develop branch — the same try/except shape is still present).
The exception type contract between nautilus_pyo3.WebSocketClient.send_pong and the Python wrapper has not changed in either direction across this window.
def_handle_ping(self, client_id: int, raw: bytes) ->None:
task=self._loop.create_task(self.send_pong(client_id, raw))
self._tasks.add(task)
asyncdefsend_pong(self, client_id: int, raw: bytes) ->None:
""" Send the given raw payload to the server as a PONG message. """client=self._clients.get(client_id)
ifclientisNone:
returntry:
awaitclient.send_pong(raw)
exceptWebSocketClientErrorase:
self._log.error(f"ws-client {client_id}: {e!s}")
Three structural issues compound:
Exception-type mismatch. The underlying nautilus_pyo3.WebSocketClient.send_pong() raises Python RuntimeError("Cannot send pong: connection not active") when the Rust-side connection state is not Active. The wrapper above catches only WebSocketClientError, so the RuntimeError propagates.
Fire-and-forget task with no exception handler._handle_ping schedules send_pong as self._loop.create_task(...) and only stores the task in a WeakSet. There is no task.add_done_callback(...) to inspect task.exception(). When the RuntimeError propagates, asyncio surfaces it via the default unhandled-exception handler, the task finishes, and the WeakSet drops it. Nothing triggers a reconnect.
Reconnect path doesn't cover this case._handle_reconnect is wired into the Rust client as post_reconnection=lambda: self._handle_reconnect(client_id). It fires only after the Rust layer has successfully re-handshaked. The pong-on-dead-connection path observed here is the opposite — the connection has dropped and the Rust layer hasn't yet entered a reconnect cycle from the Python layer's perspective. Result: _handle_reconnect never fires for this manifestation.
Observed traceback
future: <Task finished name='Task-4001' coro=<BinanceWebSocketClient.send_pong() done,
defined at /usr/local/lib/python3.12/site-packages/nautilus_trader/adapters/binance/websocket/client.py:260>
exception=RuntimeError('Cannot send pong: connection not active')>
File "/usr/local/lib/python3.12/site-packages/nautilus_trader/adapters/binance/websocket/client.py", line 269, in send_pong
await client.send_pong(raw)
RuntimeError: Cannot send pong: connection not active
(Line 269 in 1.226.0 is the await client.send_pong(raw) call.)
BinanceLiveDataClient configured for USDT Futures (demo testnet, wss://stream.binancefuture.com/market).
Four perpetual instruments subscribed to 5m bar streams plus mark-price and aggTrade.
Strategy registers on_bar(bar: Bar) and is the sole bar consumer.
The trigger is a routine Binance server-side keepalive ping timeout — not a deliberate client close. From the venue's perspective this is a normal stale-connection cleanup that happens periodically across all sessions; the bug is that NT's pong path doesn't survive the race between Binance closing the socket and NT serving the pong.
Two failure events within two hours on 2026-05-06 UTC, both on the same NT instance:
07:36:22 bar pipeline stalled — last on_bar was 3683s ago (61 min)
09:49:19 bar pipeline stalled — last on_bar was 2959s ago (49 min)
Smoking-gun comparison from the second event. All three of these are running in the same Python process, against the same network, against Binance:
09:00:08 binance_spot_ws (websockets-lib, our code)
ConnectionClosedError: no close frame received or sent
-> reconnect loop fires, healthy in 6 seconds
09:00:38 binance_futures_ws (websockets-lib, our code)
ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout
-> reconnect loop fires, healthy in 6 seconds
09:00:?? BinanceWebSocketClient (nautilus_trader)
RuntimeError('Cannot send pong: connection not active')
-> NO reconnect attempted; stays dead
-> on_bar callbacks STOP for 49 minutes
09:49:24 Host process restarted by external watchdog
09:49:30 NT recreates the WS connection during fresh boot
The same Binance event (keepalive ping timeout, server-initiated 1011 close) drops both the bot's own clients and NT's. Only NT fails to recover.
Suggested fixes (open to alternatives)
The minimal-blast-radius fix is in nautilus_trader/adapters/binance/websocket/client.py itself; ideally combined with a slightly stronger contract on the Rust side.
(a) Catch the actual exception type the Rust layer raises. Either broaden the Python except clause, or have nautilus_pyo3.WebSocketClient.send_pong raise WebSocketClientError for closed-connection conditions instead of bare RuntimeError. Example of the Python-side change:
asyncdefsend_pong(self, client_id: int, raw: bytes) ->None:
client=self._clients.get(client_id)
ifclientisNone:
returntry:
awaitclient.send_pong(raw)
exceptWebSocketClientErrorase:
self._log.error(f"ws-client {client_id}: {e!s}")
exceptRuntimeErrorase:
# Rust layer raises bare RuntimeError when the connection is# closed between ping arrival and pong send. Treat as a# benign race — the close path will (or already did) trigger# the Rust reconnect cycle, which fires post_reconnection.self._log.warning(
f"ws-client {client_id}: send_pong skipped on closed connection: {e!s}",
)
This stops the unhandled-exception log noise but on its own does not fix the no-reconnect symptom.
(b) Trigger an explicit reconnect when send_pong observes a dead connection. Detect the closed state at this seam and force the reconnect path the same way _handle_reconnect would after a normal post_reconnection. This is the structurally correct fix — the symptom we hit is "NT learned the connection was closed and did nothing about it".
(c) Attach a done_callback to the fire-and-forget task in _handle_ping so unhandled exceptions in send_pong are surfaced to a single chokepoint that can decide whether to reconnect or just log. This pairs naturally with (a)/(b).
The fix should also make nautilus_pyo3.WebSocketClient.send_pong either silent or WebSocketClientError-raising for "connection not active" — bare RuntimeError is hard for callers to match without overcatching.
Possibly related
1.226.0 release notes mention a Bybit-specific pong-frame fix (Ignore JSON pong websocket frames in Bybit #3936) and a WebSocketClient migration onto a WsTransport trait. The same exception-type-mismatch shape may exist in adapters that share the same Python wrapper; we have not exercised them.
The Python websockets library has historically raised ConnectionClosedError (a subclass of Exception, not RuntimeError) for this condition. If nautilus_pyo3 is wrapping a Rust WS implementation with similar semantics, a Python-side translation layer would be a clean place to normalise the exception type contract.
Summary
BinanceWebSocketClient.send_pong()raises an uncaughtRuntimeError("Cannot send pong: connection not active")when the underlying Rust WebSocket has been closed between the ping arriving and the pong-send task running. The exception:try / except WebSocketClientErrorinsend_pong, because the underlyingnautilus_pyo3.WebSocketClientraisesRuntimeError(notWebSocketClientError) for this condition;_handle_ping, where it surfaces as an unhandled-exception log and the task finishes;_handle_reconnect, because that callback fires only via the Rust client'spost_reconnectionhook on a successful re-handshake — not on a peer-initiated close that we observed mid-pong.The connection state remains "dead". No automatic reconnect is attempted from the Python layer; subsequent inbound frames stop arriving. For the Binance USDT-Futures market-data WS in our setup, this means
Strategy.on_baris never invoked again until the host process is restarted.We observed two such stalls within a two-hour window on 2026-05-06 (DEV, demo testnet). Each lasted ~49 minutes until our process-level watchdog hit threshold and forced a Docker restart. The same Binance keepalive ping timeout that triggered NT's stall was recovered from in ~6 seconds by a parallel
websockets-library client running in the same process — so this is specifically NT's reconnect path that fails.Affected versions
nautilus_trader==1.226.0(verified in production).developHEAD as of 2026-05-06 (verified by inspectingnautilus_trader/adapters/binance/websocket/client.pyon the develop branch — the same try/except shape is still present).The exception type contract between
nautilus_pyo3.WebSocketClient.send_pongand the Python wrapper has not changed in either direction across this window.Root cause — exact code path
nautilus_trader/adapters/binance/websocket/client.py(develop, lines 256–274):Three structural issues compound:
Exception-type mismatch. The underlying
nautilus_pyo3.WebSocketClient.send_pong()raises PythonRuntimeError("Cannot send pong: connection not active")when the Rust-side connection state is notActive. The wrapper above catches onlyWebSocketClientError, so theRuntimeErrorpropagates.Fire-and-forget task with no exception handler.
_handle_pingschedulessend_pongasself._loop.create_task(...)and only stores the task in aWeakSet. There is notask.add_done_callback(...)to inspecttask.exception(). When theRuntimeErrorpropagates, asyncio surfaces it via the default unhandled-exception handler, the task finishes, and the WeakSet drops it. Nothing triggers a reconnect.Reconnect path doesn't cover this case.
_handle_reconnectis wired into the Rust client aspost_reconnection=lambda: self._handle_reconnect(client_id). It fires only after the Rust layer has successfully re-handshaked. The pong-on-dead-connection path observed here is the opposite — the connection has dropped and the Rust layer hasn't yet entered a reconnect cycle from the Python layer's perspective. Result:_handle_reconnectnever fires for this manifestation.Observed traceback
(Line 269 in 1.226.0 is the
await client.send_pong(raw)call.)Reproduction context
Production setup:
nautilus_trader==1.226.0, Python 3.12.13, Debian GNU/Linux 13 (Docker, kernel 6.19.13), single TradingNode.BinanceLiveDataClientconfigured for USDT Futures (demo testnet,wss://stream.binancefuture.com/market).on_bar(bar: Bar)and is the sole bar consumer.The trigger is a routine Binance server-side keepalive ping timeout — not a deliberate client close. From the venue's perspective this is a normal stale-connection cleanup that happens periodically across all sessions; the bug is that NT's pong path doesn't survive the race between Binance closing the socket and NT serving the pong.
Two failure events within two hours on 2026-05-06 UTC, both on the same NT instance:
Smoking-gun comparison from the second event. All three of these are running in the same Python process, against the same network, against Binance:
The same Binance event (keepalive ping timeout, server-initiated 1011 close) drops both the bot's own clients and NT's. Only NT fails to recover.
Suggested fixes (open to alternatives)
The minimal-blast-radius fix is in
nautilus_trader/adapters/binance/websocket/client.pyitself; ideally combined with a slightly stronger contract on the Rust side.(a) Catch the actual exception type the Rust layer raises. Either broaden the Python
exceptclause, or havenautilus_pyo3.WebSocketClient.send_pongraiseWebSocketClientErrorfor closed-connection conditions instead of bareRuntimeError. Example of the Python-side change:This stops the unhandled-exception log noise but on its own does not fix the no-reconnect symptom.
(b) Trigger an explicit reconnect when send_pong observes a dead connection. Detect the closed state at this seam and force the reconnect path the same way
_handle_reconnectwould after a normalpost_reconnection. This is the structurally correct fix — the symptom we hit is "NT learned the connection was closed and did nothing about it".(c) Attach a
done_callbackto the fire-and-forget task in_handle_pingso unhandled exceptions insend_pongare surfaced to a single chokepoint that can decide whether to reconnect or just log. This pairs naturally with (a)/(b).The fix should also make
nautilus_pyo3.WebSocketClient.send_pongeither silent orWebSocketClientError-raising for "connection not active" — bareRuntimeErroris hard for callers to match without overcatching.Possibly related
1.226.0release notes mention a Bybit-specific pong-frame fix (Ignore JSON pong websocket frames in Bybit #3936) and aWebSocketClientmigration onto aWsTransporttrait. The same exception-type-mismatch shape may exist in adapters that share the same Python wrapper; we have not exercised them.websocketslibrary has historically raisedConnectionClosedError(a subclass ofException, notRuntimeError) for this condition. Ifnautilus_pyo3is wrapping a Rust WS implementation with similar semantics, a Python-side translation layer would be a clean place to normalise the exception type contract.