Skip to content

[Jobs] Pin active_workspace to cluster row in terminate_cluster#9808

Merged
DanielZhangQD merged 2 commits into
masterfrom
jobs-terminate-ws-ctx
Jun 8, 2026
Merged

[Jobs] Pin active_workspace to cluster row in terminate_cluster#9808
DanielZhangQD merged 2 commits into
masterfrom
jobs-terminate-ws-ctx

Conversation

@DanielZhangQD

Copy link
Copy Markdown
Collaborator

Summary

Fixes ClusterOwnerIdentityMismatchError raised by managed-job cancel and recovery when the underlying cluster lives in a non-default workspace:

Failed to terminate the cluster e82-X. Retrying.
Details: ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
workspace 'pub', but the active workspace is 'default'.

Root cause. The jobs controller's terminate_cluster calls core.down directly. core.downbackend.teardown_teardown does a RAW backend_utils.check_owner_identity (not auto-wrapped). The controller runs in the system/daemon process — the executor's per-request resolver doesn't fire here (skipped by the daemon gate), so skypilot_config.get_active_workspace() falls back to 'default' and the check raises for any cluster recorded in another workspace.

Pre-existing latent bug, surfaced more often after #9755 (per-user preferred workspace) because users no longer need an explicit active_workspace: in ~/.sky/config.yaml to land on a non-default workspace.

Fix. In terminate_cluster, look up the cluster row once before the retry loop and wrap core.down in local_active_workspace_ctx(cluster.workspace) — same pattern _update_cluster_status_no_lock already uses for its own owner check. nullcontext() when the cluster row is gone (core.down will raise ClusterDoesNotExist immediately and we return).

Why other controller-side call sites don't need a fix

  • controller.py:1935 core.status and :1947 core.cancel route their owner check through _update_cluster_status_no_lock, which already auto-wraps in local_active_workspace_ctx(record['workspace']).
  • recovery_strategy.py's sdk.launch / sdk.exec / sdk.cancel go HTTP → executor → resolver. The controller's process env carries the original submitter's SKYPILOT_USER_ID / SKYPILOT_USER (injected by controller_utils.controller_only_vars_to_fill), so the executor builds the original User row including its preferred_workspace and the resolver picks the right workspace.
  • SkyServe replicas + jobs pool reuse the same controller-env injection and use sdk.down (HTTP) for teardown, so no equivalent bug.

Test plan

  • pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster6 passed (4 existing + 2 new):
    • test_terminate_cluster_pins_active_workspace_from_cluster_record mocks get_cluster_from_name to return workspace='team-a', captures get_active_workspace() at the core.down call time and asserts 'team-a'. Revert-check: drop the with workspace_ctx: wrap → assertion flips.
    • test_terminate_cluster_no_record_skips_workspace_pin confirms the function doesn't crash when the cluster row is already gone.
  • Manual: launched a managed job in workspace pub with preferred_workspace='pub'. Both sky jobs cancel and triggered recovery (force-deleted underlying VM from cloud console) now tear down + re-provision cleanly. Without the fix both surfaced ClusterOwnerIdentityMismatchError.

The jobs controller's `terminate_cluster` (used by both cancel and
recovery-teardown paths) calls `core.down` directly. `core.down` →
`backend.teardown` → `_teardown` does a RAW
`backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519,
not auto-wrapped). The controller runs in the system/daemon process —
the executor's per-request resolver doesn't fire here (skipped by the
daemon gate in `_should_apply_workspace_resolver`), so
`skypilot_config.get_active_workspace()` falls back to 'default' and
the owner-identity check raises for any cluster whose recorded
workspace is not 'default':

    ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
    workspace 'pub', but the active workspace is 'default'.

Pre-existing latent bug, surfaced more often after #9755 (per-user
preferred workspace) because users no longer need an explicit
`active_workspace:` in `~/.sky/config.yaml` to land on a non-default
workspace.

Fix: in `terminate_cluster`, look up the cluster row once before the
retry loop and wrap `core.down` in
`local_active_workspace_ctx(cluster.workspace)` — the same pattern
`_update_cluster_status_no_lock` already uses for its owner check.
Falls back to `nullcontext()` when the cluster row is gone (core.down
will raise `ClusterDoesNotExist` immediately and we return — no
workspace to pin).

Scope:
* Only `terminate_cluster` is fixed because it's the only controller-
  side caller that goes through a RAW (non-auto-wrapped) owner check.
* `controller.py:1935 core.status` and `:1947 core.cancel` route their
  owner check through `_update_cluster_status_no_lock`, which already
  auto-wraps in `local_active_workspace_ctx(record['workspace'])`.
* `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel`
  go HTTP → executor → resolver. The controller's process env carries
  the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER`
  (injected by `controller_utils.controller_only_vars_to_fill`), so
  the executor builds the original `User` row including its
  `preferred_workspace` and the resolver picks the right workspace.
* SkyServe replicas + jobs pool reuse the same controller-env
  injection and use `sdk.down` (HTTP) for teardown, so no equivalent
  bug there.

Tested:
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
  — 6 passed (4 existing + 2 new):
  - `test_terminate_cluster_pins_active_workspace_from_cluster_record`
    mocks `get_cluster_from_name` to return `workspace='team-a'`,
    captures `get_active_workspace()` at the `core.down` call time
    and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:`
    wrap → assertion flips.
  - `test_terminate_cluster_no_record_skips_workspace_pin` confirms
    the function doesn't crash when the cluster row is already gone.
* Manual: launched a managed job in workspace 'pub' with
  `preferred_workspace='pub'`. Both `sky jobs cancel` and triggered
  recovery (force-deleted underlying VM) now tear down + re-provision
  cleanly. Without the fix both surfaced
  ClusterOwnerIdentityMismatchError.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that the active workspace is pinned to the cluster's recorded workspace before terminating a cluster in terminate_cluster, preventing owner-identity check failures for non-default workspaces. It also adds unit tests to verify this behavior. The review feedback correctly points out that instantiating the context manager outside the retry loop will cause a RuntimeError if a retry occurs, as generator-based context managers cannot be reused. It is recommended to instantiate the context manager inside the while True loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sky/jobs/utils.py Outdated
Comment thread sky/jobs/utils.py
`skypilot_config.local_active_workspace_ctx` is a
`@contextlib.contextmanager` generator and cannot be re-entered. The
previous version built the ctx once outside the `while True:` retry
loop — if `core.down` fails on the first try and the loop retries,
the second `with workspace_ctx:` raises `RuntimeError` (from the
spent generator) and masks the real underlying failure.

Move the ctx construction inside the loop so each retry attempt gets
a fresh generator. The DB lookup for `cluster_workspace` stays
outside the loop (cluster workspace is immutable for the life of the
cluster, no point re-querying per retry).

Caught by gemini-code-assist review on PR #9808.

Tested:
* New `test_terminate_cluster_retry_reenters_workspace_ctx`:
  mocks `get_cluster_from_name` to return `workspace='team-a'` (so
  the live ctx path is exercised, not `nullcontext()`), fails
  `core.down` twice then succeeds. Records `get_active_workspace()`
  inside each attempt and asserts `['team-a', 'team-a', 'team-a']`.
  Confirmed FAILS on the prior code with
  `AttributeError: '_GeneratorContextManager' object has no
  attribute 'args'` — exactly the symptom Gemini predicted. PASSES
  after this fix.
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
  — 7 passed (5 prior + the new retry-reentry test).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@DanielZhangQD DanielZhangQD requested a review from aylei June 5, 2026 11:23

@aylei aylei left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @DanielZhangQD !

@DanielZhangQD DanielZhangQD merged commit f999179 into master Jun 8, 2026
21 checks passed
@DanielZhangQD DanielZhangQD deleted the jobs-terminate-ws-ctx branch June 8, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants