[Workspaces] Auto-select / persist per-user default workspace#9755
Merged
Conversation
cd95497 to
418ba4d
Compare
Collaborator
Author
|
/smoke-test --kubernetes -k test_workspace_switching |
Collaborator
Author
|
/quicktest-core --kubernetes --base-branch master |
Add per-user `preferred_workspace` so users with access to multiple
workspaces don't get cryptic "permission denied to default" errors on
launch, and can persist a default once via the CLI instead of editing
`~/.sky/config.yaml` on every machine.
What it does
------------
* New column `users.preferred_workspace` (migration 019, additive,
default null).
* New resolver `sky/workspaces/core.py:resolve_workspace_for_user()` with
documented precedence:
1. explicit (--workspace flag / active_workspace in merged config)
2. preferred_workspace (if still accessible — re-validated at use)
3. 'default' if accessible (back-compat safety net for legacy
multi-workspace users and admins; without this, every existing
multi-ws user would AMBIGUOUS on upgrade)
4. exactly one accessible workspace — auto-select
5. zero accessible — NoWorkspaceAccessError
6. multiple, no preference, no 'default' access —
WorkspaceAmbiguousError (returns guidance, never silently picks)
* Resolver wired into `override_request_env_and_config` with three skip
gates: daemon requests, old clients
(MIN_PREFERRED_WORKSPACE_API_VERSION), and explicitly-set
active_workspace. Reads preferred from the User dataclass already
loaded by `add_or_update_user(return_user=True)` — no extra DB read
per request.
* Endpoint `POST /users/me/workspace { preferred: str|null }` echoes
the persisted value back. No GET — preferred ships inline on the User
row returned by `/api/health`.
* CLI:
- `sky workspace use <name>` / `sky workspace use --clear`
- `--workspace/-w` on `sky launch` and `sky jobs launch` (sugar over
`--config active_workspace=X` via a thin click callback)
- `sky api info` shows `Preferred workspace: 'X'` (or `(not set)`),
read from the existing `/api/health` payload
* D8 relaxation: cluster-op workspace check now uses RBAC access, not
active-workspace equality — without this, switching preferred would
lose ability to manage previously-launched clusters.
* `WorkspaceAmbiguousError` carries an optional `note` for RBAC drift
("preferred 'team-x' not accessible") so users understand why their
preference wasn't honored.
What it does NOT include
------------------------
* No dashboard UI for setting preferred. First iteration shipped a
user-menu picker; dropped because (a) the OSS sidebar is replaced
wholesale by feature-plugin's Sidebar.tsx in enterprise installs, so
the picker code is dead there, and (b) a flat list of N workspaces
in the user menu is the wrong density for a setting users touch
once. CLI-only entry point is sufficient. A future `/settings`-page
surface can read the same `/api/health` field and POST to the same
endpoint with no server change.
* Sky Serve workspace-awareness is not touched (separate scope).
* Admin assignment (SKY-4210) deferred — additive follow-up on the
same column.
Test plan
---------
Unit tests (55 pass):
* `test_resolve_workspace_for_user.py` — every precedence branch +
RBAC drift + admin case + exception class signatures
* `test_resolver_compat_matrix.py` — 41-case parametric matrix
asserting the back-compat invariant: every case where the legacy
server succeeded still succeeds on the new resolver with the same
workspace pick, when the user has not opted in
* `test_preferred_workspace_endpoints.py` — POST happy path, 401/403/
404 surfaces, version-gate constant, daemon/old-client/explicit
skip gates on the executor
* `test_cli_workspace.py` — `sky workspace use`, `--workspace` flag
callback, AMBIGUOUS message round-trip via CLI
* `test_viewer_route_coverage.py` — POST registered as viewer-denied
Smoke (`test_workspace_e2e_via_cli`): end-to-end CLI →
`/users/me/workspace` → resolver → cluster.workspace recording, with
three resolver branches exercised (preferred, explicit, default-
fallback) and `sky api info` round-tripping the persisted value.
The smoke test `test_workspace_switching` is updated for D8 (no
mismatch error on cross-workspace cluster ops; access check only).
Local: `bash format.sh` clean on changed files (pylint 9.94/10).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`sky api info` now prints a `Preferred workspace: (not set)` line between the `User:` and `Endpoint` lines (reads `preferred_workspace` from the /api/health user row). The existing test asserted indexes against the pre-change layout and broke in CI: Python Tests - CLI Tests / Limited Deps - Jobs, Serve & CLI (3.9) tests/test_cli.py::TestWithNoCloudEnabled::test_cli_api_info AssertionError: assert '├── Preferre...ce: (not set)' == '└── Endpoint...' Shift indexes: Endpoint line moves from output[4] to output[5], total length 6 → 7. Add an explicit assertion on the new line so future formatting changes catch their own regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five fixes prompted by Gemini and Devin review comments on the PR, plus their regression tests: 1) ContextVar does not propagate to worker processes — resolver gate never fires (Devin) `_should_apply_workspace_resolver` read the client API version via `versions.get_remote_api_version()`, a `contextvars.ContextVar` set by `APIVersionMiddleware` in the FastAPI dispatch process. Request execution actually runs in a worker spawned by `BurstableExecutor` (a `ProcessPoolExecutor`); ContextVar values do not propagate across process boundaries, so the worker always saw `None` and the gate treated every request as "old client" — silently disabling the new resolver. Fix: add `client_api_version: Optional[int]` to `RequestBody` defaulting to `server_constants.API_VERSION` at client construction time. The field is serialized to the request DB and read back in the worker, so the gate has the real value without depending on ContextVar propagation. The gate signature becomes `(is_daemon, client_api_version)`. Old clients (pre-this-PR) do not set the field; Pydantic defaults to None and the gate treats them as "old client" — same legacy semantics as before. Drops the now unused `versions` import from `executor.py`. 2) `set_preferred_workspace` SDK method missing `@versions.minimal_api_version(...)` decorator (Devin) New client talking to an older server hit a confusing HTTP 404 on the unknown `/users/me/workspace` route instead of a clean `APINotSupportedError`. Added the decorator pointing at `MIN_PREFERRED_WORKSPACE_API_VERSION` so the version-skew error surfaces with the SDK method name. 3) `WorkspaceAmbiguousError` corrupted by pickle round-trip (Devin) The exception's executor-side pickle path (`sky/server/requests/serializers/encoders.py`) reconstructs it on the client. Python's default exception pickle calls `cls(*self.args)` where `self.args` is the formatted message string — `sorted(message_string)` then sorted individual characters into `accessible`, and `super().__init__` rebuilt a garbled guidance message. Added `__reduce__` returning `(cls, (self.accessible, self.note))` so the original constructor arguments survive the round-trip. (Matches the pattern used by `ExecutionRetryableError`.) 4) Drop dead `get_user_preferred_workspace` from `global_user_state.py` (Gemini) Gemini suggested `.scalar()` over `filter_by(...).first().attr`, but the function had no remaining callers — resolver now reads `user.preferred_workspace` directly from the `User` dataclass already loaded by `add_or_update_user(return_user=True)`. Deleted the function entirely (also moots the `.scalar()` suggestion). 5) `tests/unit_tests/test_lifecycle_hooks.py` was brittle (incidental — surfaced when this PR bumped API_VERSION) `test_tail_hook_logs_minimal_api_version_required` mocked the remote version to `server_constants.API_VERSION - 1`. That formula only fires the `tail_hook_logs` decorator (`@minimal_api_version (52)`) when API_VERSION is exactly 52; this PR bumped it to 53, so `53 - 1 = 52` no longer trips the gate and the test fell through to a real cluster lookup. Pinned the mock to `51` (one less than the function's own minimum) so the test is not coupled to global API_VERSION bumps. Regression tests added: * `TestSetPreferredWorkspaceSdkVersionGate` — exercises the decorator chain on the SDK function with a pinned old remote version. * `TestRequestBodyCarriesClientApiVersion` — three tests covering default population, explicit override, and Pydantic JSON round-trip of the new field. * `TestWorkspaceAmbiguousErrorSignature.test_pickle_roundtrip...` — pickles the exception with accessible + note, asserts both attributes and `str(e)` survive intact. Validation: * 60+ UTs pass locally; `bash format.sh` clean (pylint 9.94/10). * Pickle round-trip verified empirically with `python -c` repro before/after the `__reduce__` fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`reviewable` CI job runs `git diff --exit-code agent/skills/skypilot/references/` after `agent/scripts/generate_references.py` and was failing on this PR because the new surface (--workspace flag on sky launch / sky jobs launch, sky.set_preferred_workspace SDK) was not reflected in the auto-generated reference docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Devin review caught that `local_active_workspace_ctx` in sky/skypilot_config.py does not wrap its `yield` in try/finally, so a caller whose `with` block raises leaks the new thread-local workspace value into subsequent code paths in the same worker process (ProcessPoolExecutor reuses workers). This is a pre-existing latent bug, not introduced by this PR — but this PR's new executor caller in `override_request_env_and_config` deliberately raises PermissionDeniedError from `reject_request_for_unauthorized_workspace` for workspace-rejection, making the leak path part of normal flow rather than an edge case. The practical impact on the resolver gate is masked because `override_skypilot_config` already rebinds the global `_active_workspace_context` to a fresh `threading.local()` at the start of every request, so the leak does not actually corrupt the next request's gate. But the function itself should be correct contextmanager shape, and the other ~10 callers (sky check, core launch, jobs server, serve server, backend_utils) also benefit. Fix: wrap the yield in try/finally so the original workspace is restored even when an exception escapes. Regression test added: TestIsActiveWorkspaceSet .test_local_active_workspace_ctx_cleanup_on_exception — uses a synthetic exception inside `with local_active_workspace_ctx(...)` and asserts the post-condition equals the pre-condition value. Verified the test fails when the try/finally is removed (revert check via standalone repro). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related bugs caught by Devin's review on the previous push, both
guarded by new regression tests.
1) WorkspaceAmbiguousError fails JSON deserialization (multiple values
for 'accessible')
The request executor's exception transit is NOT direct pickle — it
is `serialize_exception` -> dict {args, attributes, ...} ->
`pickle_and_encode(dict)` -> request DB -> `decode_and_unpickle`
-> `deserialize_exception` which reconstructs via
`cls(*args, **attributes)`. For this exception:
args = (formatted_message,) # from super().__init__(msg)
attributes = {'accessible': [...], 'note': ...} # from __dict__
ctor = (accessible, note=None)
`cls(formatted_message, accessible=[...], note=...)` -> TypeError:
__init__() got multiple values for argument 'accessible'. Result:
instead of the helpful "You belong to multiple workspaces… run
`sky workspace use`" guidance, the affected client sees a TypeError
— the feature is unusable for the exact users it targets.
The previous `__reduce__` override I added only fixed the direct
pickle path; the JSON serialize path runs through
`serialize_exception` which ignores `__reduce__` entirely.
Fix: extend SkyPilotExcludeArgsBaseException (existing marker in
sky/exceptions.py) so `serialize_exception` zeros out `args`. Then
deserialize calls `cls(**attributes)` cleanly. The `__reduce__`
override is kept for the direct-pickle path (defense in depth).
Regression test
`test_serialize_deserialize_roundtrip_via_request_executor_path`
exercises the actual serialize_exception → deserialize_exception
path and asserts args==() + full str(e) round-trips.
2) `client_api_version` server-side default-fill bypasses the gate
for old clients
`RequestBody.__init__` unconditionally filled `client_api_version`
with `server_constants.API_VERSION` if absent. Correct on the
CLIENT side (the client stamps its own version). WRONG on the
SERVER side: when an OLD client (pre-this-PR) sends a body
omitting the field, Pydantic deserialization calls `__init__`
again on the server, and the server's API_VERSION (53) gets
stamped — making the executor gate think it's a new client. The
resolver runs and may raise `WorkspaceAmbiguousError`, which the
old client cannot deserialize.
Fix: only apply the default-fill when `annotations.is_on_api_server`
is False (i.e., inside a `@client_api` SDK wrapper). On the server
side the field stays at its Pydantic-declared `None` default for
missing input, which the executor gate correctly interprets as
"old client → skip resolver".
Regression tests
`test_server_side_parse_of_old_client_body_leaves_field_none` and
`test_server_side_parse_of_new_client_body_preserves_field` cover
both legs of the wire format (missing field stays None; present
field is preserved). `test_default_construction_populates_local_api_version`
was renamed to `test_client_construction_stamps_local_api_version`
and now explicitly flips `is_on_api_server` to False to mimic the
`@client_api` runtime context.
Validation:
* 62 UTs pass locally (60 + 2 new server-side parse tests).
* Empirical repro before/after both fixes verified with `python -c`
scripts driving `serialize_exception/deserialize_exception` and
`RequestBody.model_validate_json` with the old-client JSON shape.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three related fixes uncovered while taking the dashboard end-to-end
through `test00` (a user with no 'default' workspace access).
1) Header-driven client version (replaces in-body client-side stamping)
Previously `RequestBody.__init__` stamped `client_api_version` on
the client side (gated by `annotations.is_on_api_server`). That
worked for the Python SDK but left the dashboard out — dashboard
JS does not construct `RequestBody`. Dropping the guard would not
help either, because server-side Pydantic deserialization of an
old client's body would then silently populate the field with the
server's own API_VERSION.
Single-source-of-truth design: the SDK already sends
`X-SkyPilot-API-Version` / `X-SkyPilot-Version` headers on every
request; `APIVersionMiddleware` parses them and sets the
`_remote_api_version` ContextVar in the FastAPI dispatch context.
`prepare_request_async` now reads that ContextVar and stamps it
onto the request body before persisting — one place, every
transport. Client-side `__init__` stamping is removed.
For the dashboard, the same two version headers are added to all
outgoing requests through `apiClient` plus three direct-fetch
sites that bypass apiClient and POST to queued routes
(`/jobs/download_logs`, `ssh_node_pools/{deploy,down}`).
`_check_version_compatibility` requires BOTH headers, so the
dashboard sends a placeholder readable version (`'dashboard;'`)
alongside `CLIENT_API_VERSION`.
2) Thread-local restoration in `local_active_workspace_ctx`
The previous try/finally fix still restored the wrong value on
exit: `get_active_workspace()` falls back to the literal
`SKYPILOT_DEFAULT_WORKSPACE` ('default') when nothing is set, so
a worker that handled one request entered the ctx with no
attribute and exited with the attribute SET to 'default'. The
next request from the same worker (especially dashboard requests
with empty `override_skypilot_config`, which short-circuits the
global rebind in `override_skypilot_config`) saw
`is_active_workspace_set()` == True, skipped the resolver, and
rejected users without 'default' access.
Now restores the thread-local attribute's STATE (had / didn't
have), not the resolved string. Downstream `get_active_workspace`
still chains to layer-2 / layer-3 fallbacks correctly.
3) CLI / SDK no longer hard-code 'default' into outgoing requests
`sky status` (cli/command.py) and `sdk.enabled_clouds` read
`get_active_workspace()` client-side and passed the result as an
explicit `?workspace=...`. With nothing set on the client this
resolved to the literal 'default' and was stamped on the wire as
explicit intent — which the server's resolver gate (c) honors,
skipping the resolver and rejecting users without 'default'
access. Both sites now send the workspace only when
`is_active_workspace_set()` is True; otherwise None and the
server-side resolver picks. `_show_enabled_infra` accepts
`Optional[str]` and skips the `(workspace: 'X')` annotation when
None, since the resolved workspace happens server-side.
Regression tests
* `test_local_active_workspace_ctx_removes_attr_when_originally_unset`
— enter with no prior attribute, exit, assert the attribute is
GONE and `is_active_workspace_set` reports False.
* `TestPrepareRequestAsyncStampsClientApiVersion` — exercises the
ContextVar -> body stamping with header-present and -absent.
* Rewrote `TestRequestBodyCarriesClientApiVersion` to assert no
client-side stamping happens in `__init__`.
Validation
* 65 UT pass locally.
* End-to-end: `sky status` and the dashboard at
`test00:test00@127.0.0.1:46580` (user with access only to `pub`)
now succeed where they previously returned
`PermissionDeniedError: ... 'default'`. Server log confirms
`resolved workspace 'pub' from single-membership for user test00`
on every queued request.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fault' access
When a managed job is submitted, the recorded `workspace` was read via
`get_active_workspace(force_user_workspace=True)`, which deliberately
bypasses the per-thread workspace context and falls back to the
config-side value (or the literal `'default'` when nothing is set).
That hack predates the per-user workspace resolver introduced by this
PR: the resolver now sets the thread-local at the request boundary
to whichever workspace the user actually intends for this request, so
force-bypassing it instead records the legacy `'default'` literal —
silently breaking users who have no access to the `'default'`
workspace (the exact scenario SKY-5468 targets).
Concretely, a user with access only to `'pub'` would submit a
managed job and see:
cluster_table.workspace = 'pub' (recorded via the resolver-set
thread-local)
job_info_table.workspace = 'default' (recorded via
force_user_workspace=True)
and the controller would then reject the job at the
`reject_request_for_unauthorized_workspace` check.
Fix: drop `force_user_workspace=True` at the two job-submission sites
that write `job_info.workspace`:
* `_maybe_submit_job_locally` (consolidation mode, jobs/server/core.py).
The call point is before any nested
`local_active_workspace_ctx(...)` is entered, so the thread-local
here is the resolver's pick.
* `_submit_remotely` (non-consolidation mode, jobs/server/core.py).
`_ensure_controller_up` has already entered AND exited its own
controller-provisioning workspace ctx by this line, so the
thread-local is restored to the outer (resolver-set) workspace.
`_exec_code_on_head` also reads `force_user_workspace=True` to populate
`ManagedJobInfo.workspace` on the gRPC `QueueJobRequest`, but the
skylet-side handler in `services.py::QueueJob` does not currently
read `request.managed_job` at all — the protobuf field is defined for
future use but is dead on the wire today. Left untouched aside from a
short comment so a future skylet that wires it up can find the
context.
Validation
* Existing UTs pass (65).
* `sky status` for a `test00` user with no `default` workspace access
succeeds end-to-end against the local API server, where it
previously hit
`PermissionDeniedError: User test00 does not have permission to
access workspace 'default'` on the managed-jobs view path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…surfaces Auth middlewares (BasicAuth, AuthProxy, BearerToken, plugin-supplied) populate request.state.auth_user from headers/JWT/credentials. The header-only paths (AuthProxy and BearerToken) build the User from id + name + type and never load the persisted preferred_workspace column. Worker-side reads were already fine — executor.prepare_request_async runs add_or_update_user(return_user=True) before the resolver sees the user — but /api/health is synchronous and returned the middleware- built User verbatim, which made `sky api info` print "(not set)" even right after `sky workspace use <name>` succeeded. Refresh the User from the DB inside health() so the response reflects the persisted preference regardless of which middleware authenticated the request. Tests: 3 new tests in test_server.py covering the refresh, the DB-miss fallback, and the unauthenticated path's skip of the lookup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The executor's reload_for_new_request pipeline refreshes config and the request-scoped lru cache on every queued worker request, but plugin endpoints that respond as sync FastAPI handlers (i.e. not queued via executor.schedule_request_async / prepare_request_async) bypass it. Their process-cached workspace config snapshot can therefore lag behind a worker process that just ran sky workspace create/update — a sync handler filtering by accessible_workspaces will exclude newly-created workspaces and include workspaces that have since been removed. Add a refresh_workspace_state_for_sync_handler() helper next to reload_for_new_request that plugin sync handlers call before reading workspace/RBAC state. It does the minimum subset of the executor's reload: safe_reload_config() to pick up new YAML/DB state, plus clear_request_level_cache() to flush the per-request _load_workspaces lru cache so the fresh config is observed. Casbin's in-memory enforcer is refreshed lazily by permission_service.get_user_roles (called inside get_accessible_workspace_names), so it doesn't need a separate refresh here. Companion fix in feature-plugin wires this into the pagination and secrets-manager plugins; both have sync handlers that read workspaces_core.get_workspaces() / check_workspace_permission and were returning stale lists post-workspace-update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9f3d5a2 to
bcadb83
Compare
Collaborator
Author
|
/quicktest-core --kubernetes --base-branch master |
Collaborator
Author
|
/smoke-test --kubernetes |
Collaborator
Author
|
/quicktest-core --kubernetes --base-branch master |
DanielZhangQD
added a commit
that referenced
this pull request
Jun 4, 2026
… shared constants (#9797) * [Workspaces] Polish per-user resolver UX: scoped error, surfaced log, shared constants Follow-ups after the per-user workspace resolver merge (#9755): * AMBIGUOUS recovery message no longer advertises `--workspace` as a universal fix — it only exists on `sky launch` / `sky jobs launch`, so callers running `sky status` / `sky queue` etc. were being pointed at a flag that doesn't apply. The three recovery options now read as parallel choices instead of replacements ("Or, for a one-shot override..."). * On `sky launch` / `sky jobs launch`, when the resolver implicitly picks PREFERRED or SINGLE_MEMBERSHIP, log "Using workspace 'X' (source: ...)" at INFO so users see where their cluster / job actually landed. EXPLICIT (user-named) and DEFAULT_FALLBACK (pre-existing implicit behavior) stay silent to avoid noise on the common cases. The whitelist correctly uses the prefixed task name (`sky.launch`) — the previous draft compared against the bare enum value and never matched at runtime. * `sky workspace info` payload contract: when the resolver raises WorkspaceAmbiguousError / NoWorkspaceAccessError / PermissionDenied, the handler now returns a state-coded `source` (ambiguous / no-access / permission-denied) plus a short single-line `note`, instead of dumping the multi-line exception message into `note` (which broke the CLI tree renderer). The CLI appends the AMBIGUOUS recovery hint as a separate paragraph below the tree, via a `WorkspaceAmbiguousError.recovery_hint()` classmethod shared between the exception and the CLI. * Move the `WORKSPACE_SOURCE_*` constants to a new leaf module `sky/workspaces/constants.py` so client-side callers (CLI) no longer drag the resolver's server-only deps (check, global_user_state, backend_utils, permission, resource_checker) into a read path. Tested: * `pytest tests/unit_tests/test_sky/test_cli_workspace.py tests/unit_tests/test_sky/users/test_preferred_workspace_endpoints.py tests/unit_tests/test_sky/workspaces/test_resolve_workspace_for_user.py tests/unit_tests/test_sky/server/requests/test_executor.py tests/unit_tests/test_sky/workspaces/test_resolver_compat_matrix.py` — 103 passed, including new regressions: - bare `'launch'` (no `sky.` prefix) MUST NOT match the whitelist (pins the prefix protocol from both sides) - DEFAULT_FALLBACK must stay silent (locks the new silent set) - AMBIGUOUS message keeps `--workspace` only in the launch footnote - ambiguous / no-access / permission-denied payloads carry their state code + short note Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * address comment --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
DanielZhangQD
added a commit
that referenced
this pull request
Jun 8, 2026
* [Jobs] Pin active_workspace to cluster row in terminate_cluster
The jobs controller's `terminate_cluster` (used by both cancel and
recovery-teardown paths) calls `core.down` directly. `core.down` →
`backend.teardown` → `_teardown` does a RAW
`backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519,
not auto-wrapped). The controller runs in the system/daemon process —
the executor's per-request resolver doesn't fire here (skipped by the
daemon gate in `_should_apply_workspace_resolver`), so
`skypilot_config.get_active_workspace()` falls back to 'default' and
the owner-identity check raises for any cluster whose recorded
workspace is not 'default':
ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
workspace 'pub', but the active workspace is 'default'.
Pre-existing latent bug, surfaced more often after #9755 (per-user
preferred workspace) because users no longer need an explicit
`active_workspace:` in `~/.sky/config.yaml` to land on a non-default
workspace.
Fix: in `terminate_cluster`, look up the cluster row once before the
retry loop and wrap `core.down` in
`local_active_workspace_ctx(cluster.workspace)` — the same pattern
`_update_cluster_status_no_lock` already uses for its owner check.
Falls back to `nullcontext()` when the cluster row is gone (core.down
will raise `ClusterDoesNotExist` immediately and we return — no
workspace to pin).
Scope:
* Only `terminate_cluster` is fixed because it's the only controller-
side caller that goes through a RAW (non-auto-wrapped) owner check.
* `controller.py:1935 core.status` and `:1947 core.cancel` route their
owner check through `_update_cluster_status_no_lock`, which already
auto-wraps in `local_active_workspace_ctx(record['workspace'])`.
* `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel`
go HTTP → executor → resolver. The controller's process env carries
the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER`
(injected by `controller_utils.controller_only_vars_to_fill`), so
the executor builds the original `User` row including its
`preferred_workspace` and the resolver picks the right workspace.
* SkyServe replicas + jobs pool reuse the same controller-env
injection and use `sdk.down` (HTTP) for teardown, so no equivalent
bug there.
Tested:
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
— 6 passed (4 existing + 2 new):
- `test_terminate_cluster_pins_active_workspace_from_cluster_record`
mocks `get_cluster_from_name` to return `workspace='team-a'`,
captures `get_active_workspace()` at the `core.down` call time
and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:`
wrap → assertion flips.
- `test_terminate_cluster_no_record_skips_workspace_pin` confirms
the function doesn't crash when the cluster row is already gone.
* Manual: launched a managed job in workspace 'pub' with
`preferred_workspace='pub'`. Both `sky jobs cancel` and triggered
recovery (force-deleted underlying VM) now tear down + re-provision
cleanly. Without the fix both surfaced
ClusterOwnerIdentityMismatchError.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* [Jobs] Construct workspace_ctx inside retry loop
`skypilot_config.local_active_workspace_ctx` is a
`@contextlib.contextmanager` generator and cannot be re-entered. The
previous version built the ctx once outside the `while True:` retry
loop — if `core.down` fails on the first try and the loop retries,
the second `with workspace_ctx:` raises `RuntimeError` (from the
spent generator) and masks the real underlying failure.
Move the ctx construction inside the loop so each retry attempt gets
a fresh generator. The DB lookup for `cluster_workspace` stays
outside the loop (cluster workspace is immutable for the life of the
cluster, no point re-querying per retry).
Caught by gemini-code-assist review on PR #9808.
Tested:
* New `test_terminate_cluster_retry_reenters_workspace_ctx`:
mocks `get_cluster_from_name` to return `workspace='team-a'` (so
the live ctx path is exercised, not `nullcontext()`), fails
`core.down` twice then succeeds. Records `get_active_workspace()`
inside each attempt and asserts `['team-a', 'team-a', 'team-a']`.
Confirmed FAILS on the prior code with
`AttributeError: '_GeneratorContextManager' object has no
attribute 'args'` — exactly the symptom Gemini predicted. PASSES
after this fix.
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
— 7 passed (5 prior + the new retry-reentry test).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add per-user
preferred_workspaceso users with access to multiple workspaces stop getting cryptic "permission denied to 'default'" on launch, and can persist a default once via the CLI instead of editing~/.sky/config.yamlon every machine.The fix targets users for whom the implicit
'default'is not valid (no access). Legacy multi-workspace users and admins for whom'default'is valid keep today's behavior — no surprise regressions on upgrade.Resolver
sky/workspaces/core.py:resolve_workspace_for_user()with documented precedence:--workspaceflag /active_workspacein merged client confignoteexplaining why)'default'if accessible — back-compat safety net (without this, every existing multi-ws user and every admin wouldAMBIGUOUSon upgrade)NoWorkspaceAccessError'default'access —WorkspaceAmbiguousError(returns guidance, never silently picks)Resolver runs in
override_request_env_and_configbehind three skip gates: daemon requests, old clients (MIN_PREFERRED_WORKSPACE_API_VERSION = 53), and explicitly-setactive_workspace. Reads preferred from theUserdataclass already loaded byadd_or_update_user(return_user=True)— no extra DB read per request. Net hot-path increment over master is one batch Casbin call.CLI
sky workspace use <name>— set preferred (RBAC-validated, errors with 403/404 surfacing)sky workspace use --clear— unset--workspace/-wonsky launchandsky jobs launch— sugar over--config active_workspace=Xvia a thin click callbacksky workspace info— Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.Endpoint
POST /users/me/workspace { preferred: str|null }— echoes the persisted value backGET— Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.Tested
Unit tests (55 pass, ~9s):
test_resolve_workspace_for_user.py— every precedence branch + RBAC drift + admin case + exception class signaturestest_resolver_compat_matrix.py— 41-case parametric matrix asserting the back-compat invariant: every case where the legacy server succeeded still succeeds on the new resolver with the same workspace pick when the user has not opted intest_preferred_workspace_endpoints.py—POSThappy path, 401 / 403 / 404 surfaces, version-gate constant, daemon / old-client / explicit skip gates on the executortest_cli_workspace.py—sky workspace use,--workspaceflag callback,AMBIGUOUSmessage round-trip via CLItest_viewer_route_coverage.py—POSTregistered as viewer-deniedLocal manual verification:
bash format.shclean on changed files (pylint 9.94/10, no new diagnostics).🤖 Generated with Claude Code