[Jobs] Pin active_workspace to cluster row in terminate_cluster#9808
Conversation
The jobs controller's `terminate_cluster` (used by both cancel and
recovery-teardown paths) calls `core.down` directly. `core.down` →
`backend.teardown` → `_teardown` does a RAW
`backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519,
not auto-wrapped). The controller runs in the system/daemon process —
the executor's per-request resolver doesn't fire here (skipped by the
daemon gate in `_should_apply_workspace_resolver`), so
`skypilot_config.get_active_workspace()` falls back to 'default' and
the owner-identity check raises for any cluster whose recorded
workspace is not 'default':
ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
workspace 'pub', but the active workspace is 'default'.
Pre-existing latent bug, surfaced more often after #9755 (per-user
preferred workspace) because users no longer need an explicit
`active_workspace:` in `~/.sky/config.yaml` to land on a non-default
workspace.
Fix: in `terminate_cluster`, look up the cluster row once before the
retry loop and wrap `core.down` in
`local_active_workspace_ctx(cluster.workspace)` — the same pattern
`_update_cluster_status_no_lock` already uses for its owner check.
Falls back to `nullcontext()` when the cluster row is gone (core.down
will raise `ClusterDoesNotExist` immediately and we return — no
workspace to pin).
Scope:
* Only `terminate_cluster` is fixed because it's the only controller-
side caller that goes through a RAW (non-auto-wrapped) owner check.
* `controller.py:1935 core.status` and `:1947 core.cancel` route their
owner check through `_update_cluster_status_no_lock`, which already
auto-wraps in `local_active_workspace_ctx(record['workspace'])`.
* `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel`
go HTTP → executor → resolver. The controller's process env carries
the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER`
(injected by `controller_utils.controller_only_vars_to_fill`), so
the executor builds the original `User` row including its
`preferred_workspace` and the resolver picks the right workspace.
* SkyServe replicas + jobs pool reuse the same controller-env
injection and use `sdk.down` (HTTP) for teardown, so no equivalent
bug there.
Tested:
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
— 6 passed (4 existing + 2 new):
- `test_terminate_cluster_pins_active_workspace_from_cluster_record`
mocks `get_cluster_from_name` to return `workspace='team-a'`,
captures `get_active_workspace()` at the `core.down` call time
and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:`
wrap → assertion flips.
- `test_terminate_cluster_no_record_skips_workspace_pin` confirms
the function doesn't crash when the cluster row is already gone.
* Manual: launched a managed job in workspace 'pub' with
`preferred_workspace='pub'`. Both `sky jobs cancel` and triggered
recovery (force-deleted underlying VM) now tear down + re-provision
cleanly. Without the fix both surfaced
ClusterOwnerIdentityMismatchError.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request ensures that the active workspace is pinned to the cluster's recorded workspace before terminating a cluster in terminate_cluster, preventing owner-identity check failures for non-default workspaces. It also adds unit tests to verify this behavior. The review feedback correctly points out that instantiating the context manager outside the retry loop will cause a RuntimeError if a retry occurs, as generator-based context managers cannot be reused. It is recommended to instantiate the context manager inside the while True loop.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
`skypilot_config.local_active_workspace_ctx` is a `@contextlib.contextmanager` generator and cannot be re-entered. The previous version built the ctx once outside the `while True:` retry loop — if `core.down` fails on the first try and the loop retries, the second `with workspace_ctx:` raises `RuntimeError` (from the spent generator) and masks the real underlying failure. Move the ctx construction inside the loop so each retry attempt gets a fresh generator. The DB lookup for `cluster_workspace` stays outside the loop (cluster workspace is immutable for the life of the cluster, no point re-querying per retry). Caught by gemini-code-assist review on PR #9808. Tested: * New `test_terminate_cluster_retry_reenters_workspace_ctx`: mocks `get_cluster_from_name` to return `workspace='team-a'` (so the live ctx path is exercised, not `nullcontext()`), fails `core.down` twice then succeeds. Records `get_active_workspace()` inside each attempt and asserts `['team-a', 'team-a', 'team-a']`. Confirmed FAILS on the prior code with `AttributeError: '_GeneratorContextManager' object has no attribute 'args'` — exactly the symptom Gemini predicted. PASSES after this fix. * `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster` — 7 passed (5 prior + the new retry-reentry test). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aylei
left a comment
There was a problem hiding this comment.
LGTM, thanks @DanielZhangQD !
Summary
Fixes
ClusterOwnerIdentityMismatchErrorraised by managed-job cancel and recovery when the underlying cluster lives in a non-default workspace:Root cause. The jobs controller's
terminate_clustercallscore.downdirectly.core.down→backend.teardown→_teardowndoes a RAWbackend_utils.check_owner_identity(not auto-wrapped). The controller runs in the system/daemon process — the executor's per-request resolver doesn't fire here (skipped by the daemon gate), soskypilot_config.get_active_workspace()falls back to'default'and the check raises for any cluster recorded in another workspace.Pre-existing latent bug, surfaced more often after #9755 (per-user preferred workspace) because users no longer need an explicit
active_workspace:in~/.sky/config.yamlto land on a non-default workspace.Fix. In
terminate_cluster, look up the cluster row once before the retry loop and wrapcore.downinlocal_active_workspace_ctx(cluster.workspace)— same pattern_update_cluster_status_no_lockalready uses for its own owner check.nullcontext()when the cluster row is gone (core.downwill raiseClusterDoesNotExistimmediately and we return).Why other controller-side call sites don't need a fix
controller.py:1935 core.statusand:1947 core.cancelroute their owner check through_update_cluster_status_no_lock, which already auto-wraps inlocal_active_workspace_ctx(record['workspace']).recovery_strategy.py'ssdk.launch/sdk.exec/sdk.cancelgo HTTP → executor → resolver. The controller's process env carries the original submitter'sSKYPILOT_USER_ID/SKYPILOT_USER(injected bycontroller_utils.controller_only_vars_to_fill), so the executor builds the originalUserrow including itspreferred_workspaceand the resolver picks the right workspace.sdk.down(HTTP) for teardown, so no equivalent bug.Test plan
pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster— 6 passed (4 existing + 2 new):test_terminate_cluster_pins_active_workspace_from_cluster_recordmocksget_cluster_from_nameto returnworkspace='team-a', capturesget_active_workspace()at thecore.downcall time and asserts'team-a'. Revert-check: drop thewith workspace_ctx:wrap → assertion flips.test_terminate_cluster_no_record_skips_workspace_pinconfirms the function doesn't crash when the cluster row is already gone.pubwithpreferred_workspace='pub'. Bothsky jobs canceland triggered recovery (force-deleted underlying VM from cloud console) now tear down + re-provision cleanly. Without the fix both surfacedClusterOwnerIdentityMismatchError.