You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
admin/live: cancel sessions by worker id (fix cross-CP cancel) (#855)
* admin/live: cancel sessions by worker id so cross-CP cancel works
The Live view's Cancel button failed with "no active session with pid N":
KillSession scans only the SERVING replica's stacks, but behind the admin ALB
the session usually lives on a different CP. And fanning the pid-based cancel
out is unsafe — pids collide across CPs (each CP's counter starts at 1000), so
a pid fan-out could kill the wrong replica's session.
Fix: cancel by the CLUSTER-UNIQUE worker id (the key the detail view already
uses), with fan-out — the same collision-safe pattern.
- KillSessionByWorkerID(wid): destroys the session on that worker on this
replica (0/1), located via SessionForWorker.
- POST /sessions/by-worker/:wid/cancel: kills locally, and only if this replica
didn't own it, fans out to peers (?scope=local recursion guard) and sums the
killed count; 404 only if no replica owns the worker. The old pid route stays
(local-only, documented).
- UI: the query-row, session-row, and detail-dialog Cancel buttons all address
by worker_id now.
- Tests: TestCancelByWorkerFansOut (local hit skips fan-out, peer-owned via
fan-out, scope=local no-recursion, unknown→404); harness
admin_cancel_by_worker kills a real session by worker id + asserts unknown→404.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP
* admin/live: address review nits on cancel-by-worker
- Drop cp_responders/cp_total from the by-worker cancel response: a worker is
owned by exactly one CP, so non-owning peers 404 (dropped by the fetcher) and
the coverage count would undercount and mislead. It's a single-owner op —
return just {killed}. (The per-user kill keeps coverage; it IS an aggregate.)
- harness admin_cancel_by_worker: hold the session (sleep) longer than the
appear-poll budget so a slow cold-start can't exit the client before the
session is observed (still cancels well before the 60s idle timeout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|`GET /api/v1/workers/fleet`| viewer | cluster worker counts by lifecycle state |
66
66
|`GET /api/v1/cluster/instances`| viewer | live CP replicas (self-flagged) |
67
-
|`POST /api/v1/sessions/:pid/cancel`| admin | tear down a session + its worker |
67
+
|`POST /api/v1/sessions/:pid/cancel`| admin | tear down a session by pid — LOCAL only (pid is per-CP); prefer the worker-id form |
68
+
|`POST /api/v1/sessions/by-worker/:wid/cancel`| admin | tear down the session on a cluster-unique worker id; fans out to whichever CP owns it (pid can't be fanned out — it collides across CPs). Returns `{killed, cp_responders, cp_total}`|
68
69
|`POST /api/v1/orgs/:id/users/:username/kill`| admin | per-user kill switch (one-shot): tear down ALL of a user's sessions + in-flight queries cluster-wide. Returns `{killed, cp_responders, cp_total}`. Does NOT block reconnects |
69
70
|`POST /api/v1/orgs/:id/users/:username/disable`| admin | persist `disabled=true` (refused at connect on PG wire + Flight), reload the snapshot cluster-wide so the block is immediate, AND kill the user's live sessions. Returns `{disabled, killed, …}`|
70
71
|`POST /api/v1/orgs/:id/users/:username/enable`| admin | persist `disabled=false` + reload cluster-wide so the user can reconnect at once |
0 commit comments