fix(ws): cancel spawned tasks on Client drop to prevent reconnect-loop leak#40
fix(ws): cancel spawned tasks on Client drop to prevent reconnect-loop leak#40kockas wants to merge 7 commits into
Conversation
…p leak
Dropping `clob::ws::Client` (or `rtds::Client`) did not cancel the
background tasks spawned by `ConnectionManager::new` and
`SubscriptionManager::start_reconnection_handler`. They kept running
for the lifetime of the process, emitting `tracing::error!("Error
handling connection: …")` on every reconnect attempt against an
endpoint that no longer had any subscribers.
Root cause was an Arc cycle: the resubscribe task captured
`Arc<SubscriptionManager>`, which owned a `ConnectionManager` clone,
which kept `state_tx`/`sender_tx` alive — so neither the `state_rx`
nor the `sender_rx` await ever errored, and neither task could exit.
Fix:
- `ws::ConnectionManager` stores its connection-loop `JoinHandle` in
an internal `Arc<AbortOnDropHandle>`. When the last
`ConnectionManager` clone drops, the abort guard fires and the
spawned task is cancelled.
- `clob::ws::SubscriptionManager` and `rtds::SubscriptionManager`
store their resubscribe `JoinHandle` in a `OnceLock` and expose
`pub fn abort_reconnection_handler(&self)`.
- `ChannelResources` (clob) gains a `Drop` impl that calls
`abort_reconnection_handler` so dropping the `Client` cleans up.
- `rtds::ClientInner` gains a `ResubAbortGuard` field that owns an
extra `Arc<SubscriptionManager>` clone and aborts on drop. (A
`Drop` impl on `ClientInner` itself isn't viable because
`authenticate()` and `deauthenticate()` destructure it.)
The cleanup chain on `Client` drop is now: ChannelResources/ClientInner
drops → resubscribe task aborted → its `Arc<SubscriptionManager>`
released → `SubscriptionManager` drops → its `ConnectionManager`
clone drops → `Arc<AbortOnDropHandle>` strong count hits zero →
connection task aborted.
Adds two `tokio::test` regression cases in `clob/ws/subscription.rs`
that prove the cycle break by `Arc::downgrade` then watching the
`Weak::upgrade` resolve to `None` after `abort_reconnection_handler`.
Without the fix, the weak handle would upgrade indefinitely.
Closes Polymarket#39
… leak-safe Per Cursor Bugbot review on PR Polymarket#40: the first version of the fix unconditionally spawned the resubscribe task before attempting `OnceLock::set`. On a repeat call the second `JoinHandle` was dropped, but Tokio drop only detaches a task — the orphan kept its `Arc<Self>` clone alive forever, recreating the very Arc cycle the PR exists to break. Switch to `OnceLock::get_or_init`: the closure runs at most once across all callers, and its captured `Arc<Self>` is only consumed on the first call. Mirror the change in both `clob::ws::SubscriptionManager` and `rtds::SubscriptionManager`. Add a regression test `start_reconnection_handler_called_twice_does_not_leak` that calls the handler twice and asserts the manager drops within 500ms after abort. Without this fix, the test would never converge — the second-call orphan task holds the strong ref forever.
|
Thanks for the catch — bot is correct. The first version unconditionally Pushed
CI gates locally clean: |
Polymarket's `POST /auth/api-key` is NOT idempotent: each successful
call (different timestamp ⇒ different EIP-712 signature) registers
a NEW api-key for the wallet, but only ONE — the canonical one —
is recognised by the authenticated user-WSS channel
(`wss://ws-subscriptions-clob.polymarket.com/ws/user`). Subscribing
with a non-canonical key is silently rejected: server resets the
connection within ~30 ms with `Connection reset without closing
handshake`. The non-canonical key still works for REST balance and
order calls, so the bug is invisible until the operator opens
`subscribe_user_events`.
Empirical reproduction 2026-05-07 with this fork on the same wallet
+ nonce=1:
- Old `create_then_derive`: returned `29184383-…`, valid for
REST orders, rejected on user-WSS.
- `GET /auth/derive-api-key` only: returned `57f6d806-…` for
the same wallet+nonce, accepted on user-WSS.
The Python SDK (py-clob-client-v2) reproduces the issue when
running against the same wallet — its `create_or_derive_api_key`
also POSTs first. The bug is endpoint-side; clients that only
call `/auth/derive-api-key` get the canonical key.
Fix
---
`create_or_derive_api_key` now derives first. Only when derive
404s (no key has ever been created for this wallet+nonce) do we
POST to create one — then re-derive to pick up the now-canonical
key. The HTTP round-trip count goes from 2 (POST + GET) to 1 (GET)
on the hot path; cold-start adds the POST + a second GET.
Tests
-----
- `create_or_derive_api_key_returns_canonical_via_derive_without_creating`
asserts POST is NEVER hit when derive succeeds (regression-pin
for the WSS rejection).
- `create_or_derive_api_key_creates_then_re_derives_when_no_canonical_exists`
pins the cold-start path — derive 404 → POST create → derive
again.
- Existing `create_authenticated` test helper and
`sign_order_should_succeed` updated: `/time` is now called
once on the derive-only path (was twice when both endpoints
were hit). 451/451 SDK tests pass.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ce7357e. Configure here.
Cursor Bugbot caught: the previous fallback guard `err.kind() == ErrorKind::Status` matched ANY non-2xx response (500/503/429/etc.). Because the new fallback now performs a mutation (`create_api_key` registers a new non-canonical key on every successful POST), a transient blip on the derive endpoint would silently register a new key — exactly the bug derive-first is meant to prevent. Tighten the guard to HTTP 404 only via a small helper `is_not_found(&Error)` that downcasts the dyn error to `crate::error::Status` and checks `status_code == NOT_FOUND`. Other status errors propagate unchanged. Test: `create_or_derive_api_key_propagates_non_404_status_without_creating` loops over 500, 503, 429 and asserts derive returns the error verbatim, with create_api_key hit zero times in each case. Full SDK suite: 452/452 passing.
|
Bugbot is right — original Pushed
Full SDK suite green: 452/452. |

Summary
Fixes #39.
clob::ws::Clientandrtds::Clientpreviously leaked the background tasks spawned byConnectionManager::newandSubscriptionManager::start_reconnection_handler— they kept running for the lifetime of the process and emittingtracing::error!("Error handling connection: …")on every reconnect attempt long after the user had dropped the client.Root cause is an Arc cycle: the resubscribe task captures
Arc<SubscriptionManager>, which owns aConnectionManagerclone, which keepsstate_tx/sender_txalive, so neither task's exit condition can ever fire.This PR breaks the cycle by making client teardown explicit and external:
ws::ConnectionManagernow stores its connection-loopJoinHandlein an internalArc<AbortOnDropHandle>. When the lastConnectionManagerclone drops, the abort guard fires and the spawned task is cancelled.clob::ws::SubscriptionManagerandrtds::SubscriptionManagerstore their resubscribeJoinHandlein aOnceLockand exposepub fn abort_reconnection_handler(&self).ChannelResources(clob) gains aDropimpl that callsabort_reconnection_handler.rtds::ClientInnergains aResubAbortGuardfield that owns an extraArc<SubscriptionManager>clone and aborts on drop. ADropimpl onClientInneritself isn't viable becauseauthenticate()anddeauthenticate()destructure it.The cleanup chain on
Clientdrop is now:ChannelResources/ClientInner drops → resubscribe task aborted → its
Arc<SubscriptionManager>released →SubscriptionManagerdrops → itsConnectionManagerclone drops →Arc<AbortOnDropHandle>strong count hits zero → connection task aborted.Test plan
The two added regression tests in clob/ws/subscription.rs:
Both use ws://127.0.0.1:1/never-connects so the test exercises only the spawned-task lifecycle, never real network traffic.
Public API impact
Additive only:
Existing fields are kept as-is. ConnectionManager gains a private field; not constructed externally so no callers are affected.
Note
Medium Risk
Touches WebSocket lifecycle management and API-key acquisition behavior; mistakes could cause lingering background reconnect loops or incorrect auth keys for user WebSocket connections.
Overview
Fixes long-lived background task leaks in WebSocket clients by making spawned reconnect/connection-loop tasks explicitly cancellable on drop.
ConnectionManagernow aborts its connection loop when the last clone is dropped, and both CLOB/RTDSSubscriptionManagers store the resubscribe taskJoinHandle(viaOnceLock) with a newabort_reconnection_handler()used fromChannelResources::drop(CLOB) and aResubAbortGuard(RTDS).Changes CLOB auth key bootstrapping to derive-first:
create_or_derive_api_keynow callsGET /auth/derive-api-keyfirst and only falls back toPOST /auth/api-keyon HTTP 404 (with a re-derive), avoiding creation of non-canonical keys that break user WebSocket auth. Updates/adds tests to lock in the new API-key behavior and reduced server-time calls.Reviewed by Cursor Bugbot for commit 8563bd6. Bugbot is set up for automated code reviews on this repo. Configure here.