Skip to content

fix(ws): cancel spawned tasks on Client drop to prevent reconnect-loop leak#40

Open
kockas wants to merge 7 commits into
Polymarket:mainfrom
kockas:fix/ws-task-leak-on-drop
Open

fix(ws): cancel spawned tasks on Client drop to prevent reconnect-loop leak#40
kockas wants to merge 7 commits into
Polymarket:mainfrom
kockas:fix/ws-task-leak-on-drop

Conversation

@kockas

@kockas kockas commented May 4, 2026

Copy link
Copy Markdown

Summary

Fixes #39.

clob::ws::Client and rtds::Client previously leaked the background tasks spawned by ConnectionManager::new and SubscriptionManager::start_reconnection_handler — they kept running for the lifetime of the process and emitting tracing::error!("Error handling connection: …") on every reconnect attempt long after the user had dropped the client.

Root cause is an Arc cycle: the resubscribe task captures Arc<SubscriptionManager>, which owns a ConnectionManager clone, which keeps state_tx/sender_tx alive, so neither task's exit condition can ever fire.

This PR breaks the cycle by making client teardown explicit and external:

  • ws::ConnectionManager now stores its connection-loop JoinHandle in an internal Arc<AbortOnDropHandle>. When the last ConnectionManager clone drops, the abort guard fires and the spawned task is cancelled.
  • clob::ws::SubscriptionManager and rtds::SubscriptionManager store their resubscribe JoinHandle in a OnceLock and expose pub fn abort_reconnection_handler(&self).
  • ChannelResources (clob) gains a Drop impl that calls abort_reconnection_handler.
  • rtds::ClientInner gains a ResubAbortGuard field that owns an extra Arc<SubscriptionManager> clone and aborts on drop. A Drop impl on ClientInner itself isn't viable because authenticate() and deauthenticate() destructure it.

The cleanup chain on Client drop is now:

ChannelResources/ClientInner drops → resubscribe task aborted → its Arc<SubscriptionManager> released → SubscriptionManager drops → its ConnectionManager clone drops → Arc<AbortOnDropHandle> strong count hits zero → connection task aborted.

Test plan

  • cargo build --all-targets --all-features — clean.
  • cargo test --all-features — 165 lib tests pass (was 163; +2 regression tests).
  • cargo clippy --all-targets --all-features -- -D warnings — clean.
  • cargo clippy --all-targets -- -D warnings — clean.
  • cargo +nightly fmt --all -- --check — clean.
  • cargo sort --check — clean.

The two added regression tests in clob/ws/subscription.rs:

  • abort_reconnection_handler_breaks_arc_cycle — Arc::downgrade a SubscriptionManager, call abort_reconnection_handler, drop the strong ref, and assert Weak::upgrade returns None within 500ms. Without the fix this never happens — the resubscribe task holds the strong ref forever.
  • connection_manager_clones_release_when_subscription_manager_aborted — same shape, exercising the chained release of the ConnectionManager clones held inside SubscriptionManager.

Both use ws://127.0.0.1:1/never-connects so the test exercises only the spawned-task lifecycle, never real network traffic.

Public API impact

Additive only:

  • clob::ws::SubscriptionManager::abort_reconnection_handler(&self)
  • rtds::SubscriptionManager::abort_reconnection_handler(&self)

Existing fields are kept as-is. ConnectionManager gains a private field; not constructed externally so no callers are affected.


Note

Medium Risk
Touches WebSocket lifecycle management and API-key acquisition behavior; mistakes could cause lingering background reconnect loops or incorrect auth keys for user WebSocket connections.

Overview
Fixes long-lived background task leaks in WebSocket clients by making spawned reconnect/connection-loop tasks explicitly cancellable on drop. ConnectionManager now aborts its connection loop when the last clone is dropped, and both CLOB/RTDS SubscriptionManagers store the resubscribe task JoinHandle (via OnceLock) with a new abort_reconnection_handler() used from ChannelResources::drop (CLOB) and a ResubAbortGuard (RTDS).

Changes CLOB auth key bootstrapping to derive-first: create_or_derive_api_key now calls GET /auth/derive-api-key first and only falls back to POST /auth/api-key on HTTP 404 (with a re-derive), avoiding creation of non-canonical keys that break user WebSocket auth. Updates/adds tests to lock in the new API-key behavior and reduced server-time calls.

Reviewed by Cursor Bugbot for commit 8563bd6. Bugbot is set up for automated code reviews on this repo. Configure here.

…p leak

Dropping `clob::ws::Client` (or `rtds::Client`) did not cancel the
background tasks spawned by `ConnectionManager::new` and
`SubscriptionManager::start_reconnection_handler`. They kept running
for the lifetime of the process, emitting `tracing::error!("Error
handling connection: …")` on every reconnect attempt against an
endpoint that no longer had any subscribers.

Root cause was an Arc cycle: the resubscribe task captured
`Arc<SubscriptionManager>`, which owned a `ConnectionManager` clone,
which kept `state_tx`/`sender_tx` alive — so neither the `state_rx`
nor the `sender_rx` await ever errored, and neither task could exit.

Fix:

- `ws::ConnectionManager` stores its connection-loop `JoinHandle` in
  an internal `Arc<AbortOnDropHandle>`. When the last
  `ConnectionManager` clone drops, the abort guard fires and the
  spawned task is cancelled.

- `clob::ws::SubscriptionManager` and `rtds::SubscriptionManager`
  store their resubscribe `JoinHandle` in a `OnceLock` and expose
  `pub fn abort_reconnection_handler(&self)`.

- `ChannelResources` (clob) gains a `Drop` impl that calls
  `abort_reconnection_handler` so dropping the `Client` cleans up.

- `rtds::ClientInner` gains a `ResubAbortGuard` field that owns an
  extra `Arc<SubscriptionManager>` clone and aborts on drop. (A
  `Drop` impl on `ClientInner` itself isn't viable because
  `authenticate()` and `deauthenticate()` destructure it.)

The cleanup chain on `Client` drop is now: ChannelResources/ClientInner
drops → resubscribe task aborted → its `Arc<SubscriptionManager>`
released → `SubscriptionManager` drops → its `ConnectionManager`
clone drops → `Arc<AbortOnDropHandle>` strong count hits zero →
connection task aborted.

Adds two `tokio::test` regression cases in `clob/ws/subscription.rs`
that prove the cycle break by `Arc::downgrade` then watching the
`Weak::upgrade` resolve to `None` after `abort_reconnection_handler`.
Without the fix, the weak handle would upgrade indefinitely.

Closes Polymarket#39
Comment thread src/clob/ws/subscription.rs Outdated
… leak-safe

Per Cursor Bugbot review on PR Polymarket#40: the first version of the fix
unconditionally spawned the resubscribe task before attempting
`OnceLock::set`. On a repeat call the second `JoinHandle` was
dropped, but Tokio drop only detaches a task — the orphan kept its
`Arc<Self>` clone alive forever, recreating the very Arc cycle the
PR exists to break.

Switch to `OnceLock::get_or_init`: the closure runs at most once
across all callers, and its captured `Arc<Self>` is only consumed
on the first call. Mirror the change in both
`clob::ws::SubscriptionManager` and `rtds::SubscriptionManager`.

Add a regression test
`start_reconnection_handler_called_twice_does_not_leak` that calls
the handler twice and asserts the manager drops within 500ms after
abort. Without this fix, the test would never converge — the
second-call orphan task holds the strong ref forever.
@kockas

kockas commented May 4, 2026

Copy link
Copy Markdown
Author

Thanks for the catch — bot is correct. The first version unconditionally tokio::spawned the resubscribe task before attempting OnceLock::set; on a second call the new JoinHandle was dropped, but Tokio drop only detaches the task. The orphan kept its Arc<Self> clone alive forever, recreating the Arc cycle this PR is meant to break.

Pushed d23a7f2:

  • start_reconnection_handler now spawns inside OnceLock::get_or_init, so the closure runs at most once across all callers and the captured Arc<Self> is only consumed on the first call.
  • Same change in rtds::SubscriptionManager for parity.
  • Added regression test start_reconnection_handler_called_twice_does_not_leak that calls the handler twice, then aborts and drops, and asserts the manager's Weak resolves to None within 500 ms. Without this fix the test never converges.

CI gates locally clean: cargo build --all-targets --all-features, cargo test --all-features, cargo clippy --all-targets --all-features -- -D warnings, cargo clippy --all-targets -- -D warnings, cargo +nightly fmt --all -- --check, cargo sort --check.

kockas added 4 commits May 7, 2026 23:07
Polymarket's `POST /auth/api-key` is NOT idempotent: each successful
call (different timestamp ⇒ different EIP-712 signature) registers
a NEW api-key for the wallet, but only ONE — the canonical one —
is recognised by the authenticated user-WSS channel
(`wss://ws-subscriptions-clob.polymarket.com/ws/user`). Subscribing
with a non-canonical key is silently rejected: server resets the
connection within ~30 ms with `Connection reset without closing
handshake`. The non-canonical key still works for REST balance and
order calls, so the bug is invisible until the operator opens
`subscribe_user_events`.

Empirical reproduction 2026-05-07 with this fork on the same wallet
+ nonce=1:
   - Old `create_then_derive`: returned `29184383-…`, valid for
     REST orders, rejected on user-WSS.
   - `GET /auth/derive-api-key` only:  returned `57f6d806-…` for
     the same wallet+nonce, accepted on user-WSS.

The Python SDK (py-clob-client-v2) reproduces the issue when
running against the same wallet — its `create_or_derive_api_key`
also POSTs first. The bug is endpoint-side; clients that only
call `/auth/derive-api-key` get the canonical key.

Fix
---
`create_or_derive_api_key` now derives first. Only when derive
404s (no key has ever been created for this wallet+nonce) do we
POST to create one — then re-derive to pick up the now-canonical
key. The HTTP round-trip count goes from 2 (POST + GET) to 1 (GET)
on the hot path; cold-start adds the POST + a second GET.

Tests
-----
- `create_or_derive_api_key_returns_canonical_via_derive_without_creating`
  asserts POST is NEVER hit when derive succeeds (regression-pin
  for the WSS rejection).
- `create_or_derive_api_key_creates_then_re_derives_when_no_canonical_exists`
  pins the cold-start path — derive 404 → POST create → derive
  again.
- Existing `create_authenticated` test helper and
  `sign_order_should_succeed` updated: `/time` is now called
  once on the derive-only path (was twice when both endpoints
  were hit). 451/451 SDK tests pass.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ce7357e. Configure here.

Comment thread src/clob/client.rs
Cursor Bugbot caught: the previous fallback guard
`err.kind() == ErrorKind::Status` matched ANY non-2xx response
(500/503/429/etc.). Because the new fallback now performs a
mutation (`create_api_key` registers a new non-canonical key on
every successful POST), a transient blip on the derive endpoint
would silently register a new key — exactly the bug derive-first
is meant to prevent.

Tighten the guard to HTTP 404 only via a small helper
`is_not_found(&Error)` that downcasts the dyn error to
`crate::error::Status` and checks `status_code == NOT_FOUND`.
Other status errors propagate unchanged.

Test: `create_or_derive_api_key_propagates_non_404_status_without_creating`
loops over 500, 503, 429 and asserts derive returns the error
verbatim, with create_api_key hit zero times in each case.

Full SDK suite: 452/452 passing.
@kockas

kockas commented May 7, 2026

Copy link
Copy Markdown
Author

Bugbot is right — original err.kind() == ErrorKind::Status matched any non-2xx. Because the new fallback (create_api_key) is a mutation that registers a non-canonical key on every successful POST, a transient 500/503/429 from /auth/derive-api-key would silently create yet another non-canonical key — recreating the bug derive-first is meant to fix.

Pushed 8563bd6:

  • Fallback now scoped to HTTP 404 specifically via is_not_found(&Error), which downcasts to crate::error::Status and checks status_code == StatusCode::NOT_FOUND.
  • Other status codes (5xx/429/etc.) propagate verbatim.
  • Regression test create_or_derive_api_key_propagates_non_404_status_without_creating loops over 500, 503, 429 and asserts (a) the error propagates and (b) create_api_key is hit zero times.

Full SDK suite green: 452/452.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WebSocket Client leaks spawned tasks on drop (Arc cycle in SubscriptionManager)

1 participant