[Jobs] Pin active_workspace to cluster row in terminate_cluster by DanielZhangQD · Pull Request #9808 · skypilot-org/skypilot

DanielZhangQD · 2026-06-05T10:04:02Z

Summary

Fixes ClusterOwnerIdentityMismatchError raised by managed-job cancel and recovery when the underlying cluster lives in a non-default workspace:

Failed to terminate the cluster e82-X. Retrying.
Details: ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
workspace 'pub', but the active workspace is 'default'.

Root cause. The jobs controller's terminate_cluster calls core.down directly. core.down → backend.teardown → _teardown does a RAW backend_utils.check_owner_identity (not auto-wrapped). The controller runs in the system/daemon process — the executor's per-request resolver doesn't fire here (skipped by the daemon gate), so skypilot_config.get_active_workspace() falls back to 'default' and the check raises for any cluster recorded in another workspace.

Pre-existing latent bug, surfaced more often after #9755 (per-user preferred workspace) because users no longer need an explicit active_workspace: in ~/.sky/config.yaml to land on a non-default workspace.

Fix. In terminate_cluster, look up the cluster row once before the retry loop and wrap core.down in local_active_workspace_ctx(cluster.workspace) — same pattern _update_cluster_status_no_lock already uses for its own owner check. nullcontext() when the cluster row is gone (core.down will raise ClusterDoesNotExist immediately and we return).

Why other controller-side call sites don't need a fix

controller.py:1935 core.status and :1947 core.cancel route their owner check through _update_cluster_status_no_lock, which already auto-wraps in local_active_workspace_ctx(record['workspace']).
recovery_strategy.py's sdk.launch / sdk.exec / sdk.cancel go HTTP → executor → resolver. The controller's process env carries the original submitter's SKYPILOT_USER_ID / SKYPILOT_USER (injected by controller_utils.controller_only_vars_to_fill), so the executor builds the original User row including its preferred_workspace and the resolver picks the right workspace.
SkyServe replicas + jobs pool reuse the same controller-env injection and use sdk.down (HTTP) for teardown, so no equivalent bug.

Test plan

pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster — 6 passed (4 existing + 2 new):
- test_terminate_cluster_pins_active_workspace_from_cluster_record mocks get_cluster_from_name to return workspace='team-a', captures get_active_workspace() at the core.down call time and asserts 'team-a'. Revert-check: drop the with workspace_ctx: wrap → assertion flips.
- test_terminate_cluster_no_record_skips_workspace_pin confirms the function doesn't crash when the cluster row is already gone.
Manual: launched a managed job in workspace pub with preferred_workspace='pub'. Both sky jobs cancel and triggered recovery (force-deleted underlying VM from cloud console) now tear down + re-provision cleanly. Without the fix both surfaced ClusterOwnerIdentityMismatchError.

The jobs controller's `terminate_cluster` (used by both cancel and recovery-teardown paths) calls `core.down` directly. `core.down` → `backend.teardown` → `_teardown` does a RAW `backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519, not auto-wrapped). The controller runs in the system/daemon process — the executor's per-request resolver doesn't fire here (skipped by the daemon gate in `_should_apply_workspace_resolver`), so `skypilot_config.get_active_workspace()` falls back to 'default' and the owner-identity check raises for any cluster whose recorded workspace is not 'default': ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in workspace 'pub', but the active workspace is 'default'. Pre-existing latent bug, surfaced more often after #9755 (per-user preferred workspace) because users no longer need an explicit `active_workspace:` in `~/.sky/config.yaml` to land on a non-default workspace. Fix: in `terminate_cluster`, look up the cluster row once before the retry loop and wrap `core.down` in `local_active_workspace_ctx(cluster.workspace)` — the same pattern `_update_cluster_status_no_lock` already uses for its owner check. Falls back to `nullcontext()` when the cluster row is gone (core.down will raise `ClusterDoesNotExist` immediately and we return — no workspace to pin). Scope: * Only `terminate_cluster` is fixed because it's the only controller- side caller that goes through a RAW (non-auto-wrapped) owner check. * `controller.py:1935 core.status` and `:1947 core.cancel` route their owner check through `_update_cluster_status_no_lock`, which already auto-wraps in `local_active_workspace_ctx(record['workspace'])`. * `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel` go HTTP → executor → resolver. The controller's process env carries the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER` (injected by `controller_utils.controller_only_vars_to_fill`), so the executor builds the original `User` row including its `preferred_workspace` and the resolver picks the right workspace. * SkyServe replicas + jobs pool reuse the same controller-env injection and use `sdk.down` (HTTP) for teardown, so no equivalent bug there. Tested: * `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster` — 6 passed (4 existing + 2 new): - `test_terminate_cluster_pins_active_workspace_from_cluster_record` mocks `get_cluster_from_name` to return `workspace='team-a'`, captures `get_active_workspace()` at the `core.down` call time and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:` wrap → assertion flips. - `test_terminate_cluster_no_record_skips_workspace_pin` confirms the function doesn't crash when the cluster row is already gone. * Manual: launched a managed job in workspace 'pub' with `preferred_workspace='pub'`. Both `sky jobs cancel` and triggered recovery (force-deleted underlying VM) now tear down + re-provision cleanly. Without the fix both surfaced ClusterOwnerIdentityMismatchError. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request ensures that the active workspace is pinned to the cluster's recorded workspace before terminating a cluster in terminate_cluster, preventing owner-identity check failures for non-default workspaces. It also adds unit tests to verify this behavior. The review feedback correctly points out that instantiating the context manager outside the retry loop will cause a RuntimeError if a retry occurs, as generator-based context managers cannot be reused. It is recommended to instantiate the context manager inside the while True loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

`skypilot_config.local_active_workspace_ctx` is a `@contextlib.contextmanager` generator and cannot be re-entered. The previous version built the ctx once outside the `while True:` retry loop — if `core.down` fails on the first try and the loop retries, the second `with workspace_ctx:` raises `RuntimeError` (from the spent generator) and masks the real underlying failure. Move the ctx construction inside the loop so each retry attempt gets a fresh generator. The DB lookup for `cluster_workspace` stays outside the loop (cluster workspace is immutable for the life of the cluster, no point re-querying per retry). Caught by gemini-code-assist review on PR #9808. Tested: * New `test_terminate_cluster_retry_reenters_workspace_ctx`: mocks `get_cluster_from_name` to return `workspace='team-a'` (so the live ctx path is exercised, not `nullcontext()`), fails `core.down` twice then succeeds. Records `get_active_workspace()` inside each attempt and asserts `['team-a', 'team-a', 'team-a']`. Confirmed FAILS on the prior code with `AttributeError: '_GeneratorContextManager' object has no attribute 'args'` — exactly the symptom Gemini predicted. PASSES after this fix. * `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster` — 7 passed (5 prior + the new retry-reentry test). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

aylei

LGTM, thanks @DanielZhangQD !

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread sky/jobs/utils.py Outdated

Comment thread sky/jobs/utils.py

DanielZhangQD requested a review from aylei June 5, 2026 11:23

aylei approved these changes Jun 8, 2026

View reviewed changes

DanielZhangQD merged commit f999179 into master Jun 8, 2026
21 checks passed

DanielZhangQD deleted the jobs-terminate-ws-ctx branch June 8, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jobs] Pin active_workspace to cluster row in terminate_cluster#9808

[Jobs] Pin active_workspace to cluster row in terminate_cluster#9808
DanielZhangQD merged 2 commits into
masterfrom
jobs-terminate-ws-ctx

DanielZhangQD commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

aylei left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DanielZhangQD commented Jun 5, 2026

Summary

Why other controller-side call sites don't need a fix

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

aylei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants