[Workspaces] Auto-select / persist per-user default workspace by DanielZhangQD · Pull Request #9755 · skypilot-org/skypilot

DanielZhangQD · 2026-05-29T10:33:25Z

Summary

Add per-user preferred_workspace so users with access to multiple workspaces stop getting cryptic "permission denied to 'default'" on launch, and can persist a default once via the CLI instead of editing ~/.sky/config.yaml on every machine.

The fix targets users for whom the implicit 'default' is not valid (no access). Legacy multi-workspace users and admins for whom 'default' is valid keep today's behavior — no surprise regressions on upgrade.

Resolver

sky/workspaces/core.py:resolve_workspace_for_user() with documented precedence:

explicit — --workspace flag / active_workspace in merged client config
preferred_workspace — if still accessible (RBAC re-validated at use; drifted preferences fall through with a note explaining why)
'default' if accessible — back-compat safety net (without this, every existing multi-ws user and every admin would AMBIGUOUS on upgrade)
exactly one accessible workspace — auto-select
zero accessible — NoWorkspaceAccessError
multiple, no preference, no 'default' access — WorkspaceAmbiguousError (returns guidance, never silently picks)

Resolver runs in override_request_env_and_config behind three skip gates: daemon requests, old clients (MIN_PREFERRED_WORKSPACE_API_VERSION = 53), and explicitly-set active_workspace. Reads preferred from the User dataclass already loaded by add_or_update_user(return_user=True) — no extra DB read per request. Net hot-path increment over master is one batch Casbin call.

CLI

sky workspace use <name> — set preferred (RBAC-validated, errors with 403/404 surfacing)
sky workspace use --clear — unset
--workspace/-w on sky launch and sky jobs launch — sugar over --config active_workspace=X via a thin click callback
sky workspace info — Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.

Endpoint

POST /users/me/workspace { preferred: str|null } — echoes the persisted value back
GET — Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.

Tested

Unit tests (55 pass, ~9s):

test_resolve_workspace_for_user.py — every precedence branch + RBAC drift + admin case + exception class signatures
test_resolver_compat_matrix.py — 41-case parametric matrix asserting the back-compat invariant: every case where the legacy server succeeded still succeeds on the new resolver with the same workspace pick when the user has not opted in
test_preferred_workspace_endpoints.py — POST happy path, 401 / 403 / 404 surfaces, version-gate constant, daemon / old-client / explicit skip gates on the executor
test_cli_workspace.py — sky workspace use, --workspace flag callback, AMBIGUOUS message round-trip via CLI
test_viewer_route_coverage.py — POST registered as viewer-denied

Local manual verification:

Set/clear preferred workspace

$ sky workspace use team-b
Set preferred workspace to 'team-b'.
$ sky workspace info | grep Preferred
   ├── Preferred workspace: 'team-b'
$ sky workspace use --clear
Cleared preferred workspace.
$ sky workspace info | grep Preferred
   ├── Preferred workspace: (not set)

No active_workspace set in Server
- No active_workspace set in Client
  - Users with User role
    - 0 accessible ws
    - 1 accessible ws - default
    - 1 accessible ws - non-default
    - N accessible ws - default included
    - N accessible ws - default not included
  - Users with admin role
    - 1 ws
    - N ws
active_workspace is set in client, use this one
- Preferred workspace is set or not set
- active_workspace is set in server
active_workspace is set in the server (and not in the client), use this one
- Preferred workspace is set or not set

bash format.sh clean on changed files (pylint 9.94/10, no new diagnostics).

🤖 Generated with Claude Code

DanielZhangQD · 2026-06-02T09:10:00Z

/smoke-test --kubernetes -k test_workspace_switching
/smoke-test --kubernetes -k test_workspace_e2e_via_cli

DanielZhangQD · 2026-06-02T09:10:35Z

/quicktest-core --kubernetes --base-branch master

Add per-user `preferred_workspace` so users with access to multiple workspaces don't get cryptic "permission denied to default" errors on launch, and can persist a default once via the CLI instead of editing `~/.sky/config.yaml` on every machine. What it does ------------ * New column `users.preferred_workspace` (migration 019, additive, default null). * New resolver `sky/workspaces/core.py:resolve_workspace_for_user()` with documented precedence: 1. explicit (--workspace flag / active_workspace in merged config) 2. preferred_workspace (if still accessible — re-validated at use) 3. 'default' if accessible (back-compat safety net for legacy multi-workspace users and admins; without this, every existing multi-ws user would AMBIGUOUS on upgrade) 4. exactly one accessible workspace — auto-select 5. zero accessible — NoWorkspaceAccessError 6. multiple, no preference, no 'default' access — WorkspaceAmbiguousError (returns guidance, never silently picks) * Resolver wired into `override_request_env_and_config` with three skip gates: daemon requests, old clients (MIN_PREFERRED_WORKSPACE_API_VERSION), and explicitly-set active_workspace. Reads preferred from the User dataclass already loaded by `add_or_update_user(return_user=True)` — no extra DB read per request. * Endpoint `POST /users/me/workspace { preferred: str|null }` echoes the persisted value back. No GET — preferred ships inline on the User row returned by `/api/health`. * CLI: - `sky workspace use <name>` / `sky workspace use --clear` - `--workspace/-w` on `sky launch` and `sky jobs launch` (sugar over `--config active_workspace=X` via a thin click callback) - `sky api info` shows `Preferred workspace: 'X'` (or `(not set)`), read from the existing `/api/health` payload * D8 relaxation: cluster-op workspace check now uses RBAC access, not active-workspace equality — without this, switching preferred would lose ability to manage previously-launched clusters. * `WorkspaceAmbiguousError` carries an optional `note` for RBAC drift ("preferred 'team-x' not accessible") so users understand why their preference wasn't honored. What it does NOT include ------------------------ * No dashboard UI for setting preferred. First iteration shipped a user-menu picker; dropped because (a) the OSS sidebar is replaced wholesale by feature-plugin's Sidebar.tsx in enterprise installs, so the picker code is dead there, and (b) a flat list of N workspaces in the user menu is the wrong density for a setting users touch once. CLI-only entry point is sufficient. A future `/settings`-page surface can read the same `/api/health` field and POST to the same endpoint with no server change. * Sky Serve workspace-awareness is not touched (separate scope). * Admin assignment (SKY-4210) deferred — additive follow-up on the same column. Test plan --------- Unit tests (55 pass): * `test_resolve_workspace_for_user.py` — every precedence branch + RBAC drift + admin case + exception class signatures * `test_resolver_compat_matrix.py` — 41-case parametric matrix asserting the back-compat invariant: every case where the legacy server succeeded still succeeds on the new resolver with the same workspace pick, when the user has not opted in * `test_preferred_workspace_endpoints.py` — POST happy path, 401/403/ 404 surfaces, version-gate constant, daemon/old-client/explicit skip gates on the executor * `test_cli_workspace.py` — `sky workspace use`, `--workspace` flag callback, AMBIGUOUS message round-trip via CLI * `test_viewer_route_coverage.py` — POST registered as viewer-denied Smoke (`test_workspace_e2e_via_cli`): end-to-end CLI → `/users/me/workspace` → resolver → cluster.workspace recording, with three resolver branches exercised (preferred, explicit, default- fallback) and `sky api info` round-tripping the persisted value. The smoke test `test_workspace_switching` is updated for D8 (no mismatch error on cross-workspace cluster ops; access check only). Local: `bash format.sh` clean on changed files (pylint 9.94/10). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`sky api info` now prints a `Preferred workspace: (not set)` line between the `User:` and `Endpoint` lines (reads `preferred_workspace` from the /api/health user row). The existing test asserted indexes against the pre-change layout and broke in CI: Python Tests - CLI Tests / Limited Deps - Jobs, Serve & CLI (3.9) tests/test_cli.py::TestWithNoCloudEnabled::test_cli_api_info AssertionError: assert '├── Preferre...ce: (not set)' == '└── Endpoint...' Shift indexes: Endpoint line moves from output[4] to output[5], total length 6 → 7. Add an explicit assertion on the new line so future formatting changes catch their own regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Five fixes prompted by Gemini and Devin review comments on the PR, plus their regression tests: 1) ContextVar does not propagate to worker processes — resolver gate never fires (Devin) `_should_apply_workspace_resolver` read the client API version via `versions.get_remote_api_version()`, a `contextvars.ContextVar` set by `APIVersionMiddleware` in the FastAPI dispatch process. Request execution actually runs in a worker spawned by `BurstableExecutor` (a `ProcessPoolExecutor`); ContextVar values do not propagate across process boundaries, so the worker always saw `None` and the gate treated every request as "old client" — silently disabling the new resolver. Fix: add `client_api_version: Optional[int]` to `RequestBody` defaulting to `server_constants.API_VERSION` at client construction time. The field is serialized to the request DB and read back in the worker, so the gate has the real value without depending on ContextVar propagation. The gate signature becomes `(is_daemon, client_api_version)`. Old clients (pre-this-PR) do not set the field; Pydantic defaults to None and the gate treats them as "old client" — same legacy semantics as before. Drops the now unused `versions` import from `executor.py`. 2) `set_preferred_workspace` SDK method missing `@versions.minimal_api_version(...)` decorator (Devin) New client talking to an older server hit a confusing HTTP 404 on the unknown `/users/me/workspace` route instead of a clean `APINotSupportedError`. Added the decorator pointing at `MIN_PREFERRED_WORKSPACE_API_VERSION` so the version-skew error surfaces with the SDK method name. 3) `WorkspaceAmbiguousError` corrupted by pickle round-trip (Devin) The exception's executor-side pickle path (`sky/server/requests/serializers/encoders.py`) reconstructs it on the client. Python's default exception pickle calls `cls(*self.args)` where `self.args` is the formatted message string — `sorted(message_string)` then sorted individual characters into `accessible`, and `super().__init__` rebuilt a garbled guidance message. Added `__reduce__` returning `(cls, (self.accessible, self.note))` so the original constructor arguments survive the round-trip. (Matches the pattern used by `ExecutionRetryableError`.) 4) Drop dead `get_user_preferred_workspace` from `global_user_state.py` (Gemini) Gemini suggested `.scalar()` over `filter_by(...).first().attr`, but the function had no remaining callers — resolver now reads `user.preferred_workspace` directly from the `User` dataclass already loaded by `add_or_update_user(return_user=True)`. Deleted the function entirely (also moots the `.scalar()` suggestion). 5) `tests/unit_tests/test_lifecycle_hooks.py` was brittle (incidental — surfaced when this PR bumped API_VERSION) `test_tail_hook_logs_minimal_api_version_required` mocked the remote version to `server_constants.API_VERSION - 1`. That formula only fires the `tail_hook_logs` decorator (`@minimal_api_version (52)`) when API_VERSION is exactly 52; this PR bumped it to 53, so `53 - 1 = 52` no longer trips the gate and the test fell through to a real cluster lookup. Pinned the mock to `51` (one less than the function's own minimum) so the test is not coupled to global API_VERSION bumps. Regression tests added: * `TestSetPreferredWorkspaceSdkVersionGate` — exercises the decorator chain on the SDK function with a pinned old remote version. * `TestRequestBodyCarriesClientApiVersion` — three tests covering default population, explicit override, and Pydantic JSON round-trip of the new field. * `TestWorkspaceAmbiguousErrorSignature.test_pickle_roundtrip...` — pickles the exception with accessible + note, asserts both attributes and `str(e)` survive intact. Validation: * 60+ UTs pass locally; `bash format.sh` clean (pylint 9.94/10). * Pickle round-trip verified empirically with `python -c` repro before/after the `__reduce__` fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`reviewable` CI job runs `git diff --exit-code agent/skills/skypilot/references/` after `agent/scripts/generate_references.py` and was failing on this PR because the new surface (--workspace flag on sky launch / sky jobs launch, sky.set_preferred_workspace SDK) was not reflected in the auto-generated reference docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Devin review caught that `local_active_workspace_ctx` in sky/skypilot_config.py does not wrap its `yield` in try/finally, so a caller whose `with` block raises leaks the new thread-local workspace value into subsequent code paths in the same worker process (ProcessPoolExecutor reuses workers). This is a pre-existing latent bug, not introduced by this PR — but this PR's new executor caller in `override_request_env_and_config` deliberately raises PermissionDeniedError from `reject_request_for_unauthorized_workspace` for workspace-rejection, making the leak path part of normal flow rather than an edge case. The practical impact on the resolver gate is masked because `override_skypilot_config` already rebinds the global `_active_workspace_context` to a fresh `threading.local()` at the start of every request, so the leak does not actually corrupt the next request's gate. But the function itself should be correct contextmanager shape, and the other ~10 callers (sky check, core launch, jobs server, serve server, backend_utils) also benefit. Fix: wrap the yield in try/finally so the original workspace is restored even when an exception escapes. Regression test added: TestIsActiveWorkspaceSet .test_local_active_workspace_ctx_cleanup_on_exception — uses a synthetic exception inside `with local_active_workspace_ctx(...)` and asserts the post-condition equals the pre-condition value. Verified the test fails when the try/finally is removed (revert check via standalone repro). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two related bugs caught by Devin's review on the previous push, both guarded by new regression tests. 1) WorkspaceAmbiguousError fails JSON deserialization (multiple values for 'accessible') The request executor's exception transit is NOT direct pickle — it is `serialize_exception` -> dict {args, attributes, ...} -> `pickle_and_encode(dict)` -> request DB -> `decode_and_unpickle` -> `deserialize_exception` which reconstructs via `cls(*args, **attributes)`. For this exception: args = (formatted_message,) # from super().__init__(msg) attributes = {'accessible': [...], 'note': ...} # from __dict__ ctor = (accessible, note=None) `cls(formatted_message, accessible=[...], note=...)` -> TypeError: __init__() got multiple values for argument 'accessible'. Result: instead of the helpful "You belong to multiple workspaces… run `sky workspace use`" guidance, the affected client sees a TypeError — the feature is unusable for the exact users it targets. The previous `__reduce__` override I added only fixed the direct pickle path; the JSON serialize path runs through `serialize_exception` which ignores `__reduce__` entirely. Fix: extend SkyPilotExcludeArgsBaseException (existing marker in sky/exceptions.py) so `serialize_exception` zeros out `args`. Then deserialize calls `cls(**attributes)` cleanly. The `__reduce__` override is kept for the direct-pickle path (defense in depth). Regression test `test_serialize_deserialize_roundtrip_via_request_executor_path` exercises the actual serialize_exception → deserialize_exception path and asserts args==() + full str(e) round-trips. 2) `client_api_version` server-side default-fill bypasses the gate for old clients `RequestBody.__init__` unconditionally filled `client_api_version` with `server_constants.API_VERSION` if absent. Correct on the CLIENT side (the client stamps its own version). WRONG on the SERVER side: when an OLD client (pre-this-PR) sends a body omitting the field, Pydantic deserialization calls `__init__` again on the server, and the server's API_VERSION (53) gets stamped — making the executor gate think it's a new client. The resolver runs and may raise `WorkspaceAmbiguousError`, which the old client cannot deserialize. Fix: only apply the default-fill when `annotations.is_on_api_server` is False (i.e., inside a `@client_api` SDK wrapper). On the server side the field stays at its Pydantic-declared `None` default for missing input, which the executor gate correctly interprets as "old client → skip resolver". Regression tests `test_server_side_parse_of_old_client_body_leaves_field_none` and `test_server_side_parse_of_new_client_body_preserves_field` cover both legs of the wire format (missing field stays None; present field is preserved). `test_default_construction_populates_local_api_version` was renamed to `test_client_construction_stamps_local_api_version` and now explicitly flips `is_on_api_server` to False to mimic the `@client_api` runtime context. Validation: * 62 UTs pass locally (60 + 2 new server-side parse tests). * Empirical repro before/after both fixes verified with `python -c` scripts driving `serialize_exception/deserialize_exception` and `RequestBody.model_validate_json` with the old-client JSON shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three related fixes uncovered while taking the dashboard end-to-end through `test00` (a user with no 'default' workspace access). 1) Header-driven client version (replaces in-body client-side stamping) Previously `RequestBody.__init__` stamped `client_api_version` on the client side (gated by `annotations.is_on_api_server`). That worked for the Python SDK but left the dashboard out — dashboard JS does not construct `RequestBody`. Dropping the guard would not help either, because server-side Pydantic deserialization of an old client's body would then silently populate the field with the server's own API_VERSION. Single-source-of-truth design: the SDK already sends `X-SkyPilot-API-Version` / `X-SkyPilot-Version` headers on every request; `APIVersionMiddleware` parses them and sets the `_remote_api_version` ContextVar in the FastAPI dispatch context. `prepare_request_async` now reads that ContextVar and stamps it onto the request body before persisting — one place, every transport. Client-side `__init__` stamping is removed. For the dashboard, the same two version headers are added to all outgoing requests through `apiClient` plus three direct-fetch sites that bypass apiClient and POST to queued routes (`/jobs/download_logs`, `ssh_node_pools/{deploy,down}`). `_check_version_compatibility` requires BOTH headers, so the dashboard sends a placeholder readable version (`'dashboard;'`) alongside `CLIENT_API_VERSION`. 2) Thread-local restoration in `local_active_workspace_ctx` The previous try/finally fix still restored the wrong value on exit: `get_active_workspace()` falls back to the literal `SKYPILOT_DEFAULT_WORKSPACE` ('default') when nothing is set, so a worker that handled one request entered the ctx with no attribute and exited with the attribute SET to 'default'. The next request from the same worker (especially dashboard requests with empty `override_skypilot_config`, which short-circuits the global rebind in `override_skypilot_config`) saw `is_active_workspace_set()` == True, skipped the resolver, and rejected users without 'default' access. Now restores the thread-local attribute's STATE (had / didn't have), not the resolved string. Downstream `get_active_workspace` still chains to layer-2 / layer-3 fallbacks correctly. 3) CLI / SDK no longer hard-code 'default' into outgoing requests `sky status` (cli/command.py) and `sdk.enabled_clouds` read `get_active_workspace()` client-side and passed the result as an explicit `?workspace=...`. With nothing set on the client this resolved to the literal 'default' and was stamped on the wire as explicit intent — which the server's resolver gate (c) honors, skipping the resolver and rejecting users without 'default' access. Both sites now send the workspace only when `is_active_workspace_set()` is True; otherwise None and the server-side resolver picks. `_show_enabled_infra` accepts `Optional[str]` and skips the `(workspace: 'X')` annotation when None, since the resolved workspace happens server-side. Regression tests * `test_local_active_workspace_ctx_removes_attr_when_originally_unset` — enter with no prior attribute, exit, assert the attribute is GONE and `is_active_workspace_set` reports False. * `TestPrepareRequestAsyncStampsClientApiVersion` — exercises the ContextVar -> body stamping with header-present and -absent. * Rewrote `TestRequestBodyCarriesClientApiVersion` to assert no client-side stamping happens in `__init__`. Validation * 65 UT pass locally. * End-to-end: `sky status` and the dashboard at `test00:test00@127.0.0.1:46580` (user with access only to `pub`) now succeed where they previously returned `PermissionDeniedError: ... 'default'`. Server log confirms `resolved workspace 'pub' from single-membership for user test00` on every queued request. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…fault' access When a managed job is submitted, the recorded `workspace` was read via `get_active_workspace(force_user_workspace=True)`, which deliberately bypasses the per-thread workspace context and falls back to the config-side value (or the literal `'default'` when nothing is set). That hack predates the per-user workspace resolver introduced by this PR: the resolver now sets the thread-local at the request boundary to whichever workspace the user actually intends for this request, so force-bypassing it instead records the legacy `'default'` literal — silently breaking users who have no access to the `'default'` workspace (the exact scenario SKY-5468 targets). Concretely, a user with access only to `'pub'` would submit a managed job and see: cluster_table.workspace = 'pub' (recorded via the resolver-set thread-local) job_info_table.workspace = 'default' (recorded via force_user_workspace=True) and the controller would then reject the job at the `reject_request_for_unauthorized_workspace` check. Fix: drop `force_user_workspace=True` at the two job-submission sites that write `job_info.workspace`: * `_maybe_submit_job_locally` (consolidation mode, jobs/server/core.py). The call point is before any nested `local_active_workspace_ctx(...)` is entered, so the thread-local here is the resolver's pick. * `_submit_remotely` (non-consolidation mode, jobs/server/core.py). `_ensure_controller_up` has already entered AND exited its own controller-provisioning workspace ctx by this line, so the thread-local is restored to the outer (resolver-set) workspace. `_exec_code_on_head` also reads `force_user_workspace=True` to populate `ManagedJobInfo.workspace` on the gRPC `QueueJobRequest`, but the skylet-side handler in `services.py::QueueJob` does not currently read `request.managed_job` at all — the protobuf field is defined for future use but is dead on the wire today. Left untouched aside from a short comment so a future skylet that wires it up can find the context. Validation * Existing UTs pass (65). * `sky status` for a `test00` user with no `default` workspace access succeeds end-to-end against the local API server, where it previously hit `PermissionDeniedError: User test00 does not have permission to access workspace 'default'` on the managed-jobs view path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…surfaces Auth middlewares (BasicAuth, AuthProxy, BearerToken, plugin-supplied) populate request.state.auth_user from headers/JWT/credentials. The header-only paths (AuthProxy and BearerToken) build the User from id + name + type and never load the persisted preferred_workspace column. Worker-side reads were already fine — executor.prepare_request_async runs add_or_update_user(return_user=True) before the resolver sees the user — but /api/health is synchronous and returned the middleware- built User verbatim, which made `sky api info` print "(not set)" even right after `sky workspace use <name>` succeeded. Refresh the User from the DB inside health() so the response reflects the persisted preference regardless of which middleware authenticated the request. Tests: 3 new tests in test_server.py covering the refresh, the DB-miss fallback, and the unauthenticated path's skip of the lookup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The executor's reload_for_new_request pipeline refreshes config and the request-scoped lru cache on every queued worker request, but plugin endpoints that respond as sync FastAPI handlers (i.e. not queued via executor.schedule_request_async / prepare_request_async) bypass it. Their process-cached workspace config snapshot can therefore lag behind a worker process that just ran sky workspace create/update — a sync handler filtering by accessible_workspaces will exclude newly-created workspaces and include workspaces that have since been removed. Add a refresh_workspace_state_for_sync_handler() helper next to reload_for_new_request that plugin sync handlers call before reading workspace/RBAC state. It does the minimum subset of the executor's reload: safe_reload_config() to pick up new YAML/DB state, plus clear_request_level_cache() to flush the per-request _load_workspaces lru cache so the fresh config is observed. Casbin's in-memory enforcer is refreshed lazily by permission_service.get_user_roles (called inside get_accessible_workspace_names), so it doesn't need a separate refresh here. Companion fix in feature-plugin wires this into the pagination and secrets-manager plugins; both have sync handlers that read workspaces_core.get_workspaces() / check_workspace_permission and were returning stale lists post-workspace-update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

DanielZhangQD · 2026-06-02T13:01:30Z

/quicktest-core --kubernetes --base-branch master
https://buildkite.com/skypilot-1/quicktest-core/builds/3329#_
/smoke-test --kubernetes -k test_workspace_switching
https://buildkite.com/skypilot-1/smoke-tests/builds/11078#_

aylei

The rest LGTM!

DanielZhangQD · 2026-06-02T13:47:42Z

/smoke-test --kubernetes
https://buildkite.com/skypilot-1/smoke-tests/builds/11079#_

DanielZhangQD · 2026-06-03T04:34:32Z

/quicktest-core --kubernetes --base-branch master

… shared constants (#9797) * [Workspaces] Polish per-user resolver UX: scoped error, surfaced log, shared constants Follow-ups after the per-user workspace resolver merge (#9755): * AMBIGUOUS recovery message no longer advertises `--workspace` as a universal fix — it only exists on `sky launch` / `sky jobs launch`, so callers running `sky status` / `sky queue` etc. were being pointed at a flag that doesn't apply. The three recovery options now read as parallel choices instead of replacements ("Or, for a one-shot override..."). * On `sky launch` / `sky jobs launch`, when the resolver implicitly picks PREFERRED or SINGLE_MEMBERSHIP, log "Using workspace 'X' (source: ...)" at INFO so users see where their cluster / job actually landed. EXPLICIT (user-named) and DEFAULT_FALLBACK (pre-existing implicit behavior) stay silent to avoid noise on the common cases. The whitelist correctly uses the prefixed task name (`sky.launch`) — the previous draft compared against the bare enum value and never matched at runtime. * `sky workspace info` payload contract: when the resolver raises WorkspaceAmbiguousError / NoWorkspaceAccessError / PermissionDenied, the handler now returns a state-coded `source` (ambiguous / no-access / permission-denied) plus a short single-line `note`, instead of dumping the multi-line exception message into `note` (which broke the CLI tree renderer). The CLI appends the AMBIGUOUS recovery hint as a separate paragraph below the tree, via a `WorkspaceAmbiguousError.recovery_hint()` classmethod shared between the exception and the CLI. * Move the `WORKSPACE_SOURCE_*` constants to a new leaf module `sky/workspaces/constants.py` so client-side callers (CLI) no longer drag the resolver's server-only deps (check, global_user_state, backend_utils, permission, resource_checker) into a read path. Tested: * `pytest tests/unit_tests/test_sky/test_cli_workspace.py tests/unit_tests/test_sky/users/test_preferred_workspace_endpoints.py tests/unit_tests/test_sky/workspaces/test_resolve_workspace_for_user.py tests/unit_tests/test_sky/server/requests/test_executor.py tests/unit_tests/test_sky/workspaces/test_resolver_compat_matrix.py` — 103 passed, including new regressions: - bare `'launch'` (no `sky.` prefix) MUST NOT match the whitelist (pins the prefix protocol from both sides) - DEFAULT_FALLBACK must stay silent (locks the new silent set) - AMBIGUOUS message keeps `--workspace` only in the launch footnote - ambiguous / no-access / permission-denied payloads carry their state code + short note Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * address comment --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* [Jobs] Pin active_workspace to cluster row in terminate_cluster The jobs controller's `terminate_cluster` (used by both cancel and recovery-teardown paths) calls `core.down` directly. `core.down` → `backend.teardown` → `_teardown` does a RAW `backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519, not auto-wrapped). The controller runs in the system/daemon process — the executor's per-request resolver doesn't fire here (skipped by the daemon gate in `_should_apply_workspace_resolver`), so `skypilot_config.get_active_workspace()` falls back to 'default' and the owner-identity check raises for any cluster whose recorded workspace is not 'default': ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in workspace 'pub', but the active workspace is 'default'. Pre-existing latent bug, surfaced more often after #9755 (per-user preferred workspace) because users no longer need an explicit `active_workspace:` in `~/.sky/config.yaml` to land on a non-default workspace. Fix: in `terminate_cluster`, look up the cluster row once before the retry loop and wrap `core.down` in `local_active_workspace_ctx(cluster.workspace)` — the same pattern `_update_cluster_status_no_lock` already uses for its owner check. Falls back to `nullcontext()` when the cluster row is gone (core.down will raise `ClusterDoesNotExist` immediately and we return — no workspace to pin). Scope: * Only `terminate_cluster` is fixed because it's the only controller- side caller that goes through a RAW (non-auto-wrapped) owner check. * `controller.py:1935 core.status` and `:1947 core.cancel` route their owner check through `_update_cluster_status_no_lock`, which already auto-wraps in `local_active_workspace_ctx(record['workspace'])`. * `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel` go HTTP → executor → resolver. The controller's process env carries the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER` (injected by `controller_utils.controller_only_vars_to_fill`), so the executor builds the original `User` row including its `preferred_workspace` and the resolver picks the right workspace. * SkyServe replicas + jobs pool reuse the same controller-env injection and use `sdk.down` (HTTP) for teardown, so no equivalent bug there. Tested: * `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster` — 6 passed (4 existing + 2 new): - `test_terminate_cluster_pins_active_workspace_from_cluster_record` mocks `get_cluster_from_name` to return `workspace='team-a'`, captures `get_active_workspace()` at the `core.down` call time and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:` wrap → assertion flips. - `test_terminate_cluster_no_record_skips_workspace_pin` confirms the function doesn't crash when the cluster row is already gone. * Manual: launched a managed job in workspace 'pub' with `preferred_workspace='pub'`. Both `sky jobs cancel` and triggered recovery (force-deleted underlying VM) now tear down + re-provision cleanly. Without the fix both surfaced ClusterOwnerIdentityMismatchError. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Jobs] Construct workspace_ctx inside retry loop `skypilot_config.local_active_workspace_ctx` is a `@contextlib.contextmanager` generator and cannot be re-entered. The previous version built the ctx once outside the `while True:` retry loop — if `core.down` fails on the first try and the loop retries, the second `with workspace_ctx:` raises `RuntimeError` (from the spent generator) and masks the real underlying failure. Move the ctx construction inside the loop so each retry attempt gets a fresh generator. The DB lookup for `cluster_workspace` stays outside the loop (cluster workspace is immutable for the life of the cluster, no point re-querying per retry). Caught by gemini-code-assist review on PR #9808. Tested: * New `test_terminate_cluster_retry_reenters_workspace_ctx`: mocks `get_cluster_from_name` to return `workspace='team-a'` (so the live ctx path is exercised, not `nullcontext()`), fails `core.down` twice then succeeds. Records `get_active_workspace()` inside each attempt and asserts `['team-a', 'team-a', 'team-a']`. Confirmed FAILS on the prior code with `AttributeError: '_GeneratorContextManager' object has no attribute 'args'` — exactly the symptom Gemini predicted. PASSES after this fix. * `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster` — 7 passed (5 prior + the new retry-reentry test). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

DanielZhangQD marked this pull request as ready for review May 29, 2026 10:34

This comment was marked as resolved.

Sign in to view

DanielZhangQD force-pushed the sky-5468-workspace-default branch from cd95497 to 418ba4d Compare June 2, 2026 06:03

DanielZhangQD and others added 12 commits June 2, 2026 20:53

update comments

338c476

update code

bcadb83

DanielZhangQD force-pushed the sky-5468-workspace-default branch from 9f3d5a2 to bcadb83 Compare June 2, 2026 12:53

DanielZhangQD requested a review from aylei June 2, 2026 13:09

aylei approved these changes Jun 2, 2026

View reviewed changes

Comment thread sky/client/cli/command.py Outdated

DanielZhangQD added 3 commits June 3, 2026 10:34

add getting workspace info command

8b80cf8

update docstr

ca43bb4

update docs

8a2fa5a

DanielZhangQD added 2 commits June 3, 2026 11:43

update rbac

ba77ce3

update docstr

3eecdd7

DanielZhangQD merged commit f67ca60 into master Jun 3, 2026
22 of 23 checks passed

DanielZhangQD deleted the sky-5468-workspace-default branch June 3, 2026 05:35

DanielZhangQD mentioned this pull request Jun 4, 2026

[Workspaces] Polish per-user resolver UX: scoped error, surfaced log, shared constants #9797

Merged

DanielZhangQD mentioned this pull request Jun 5, 2026

[Jobs] Pin active_workspace to cluster row in terminate_cluster #9808

Merged

2 tasks

Conversation

DanielZhangQD commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Resolver

CLI

Endpoint

Tested

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

DanielZhangQD commented Jun 2, 2026

Uh oh!

DanielZhangQD commented Jun 2, 2026

Uh oh!

DanielZhangQD commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aylei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DanielZhangQD commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DanielZhangQD commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DanielZhangQD commented May 29, 2026 •

edited

Loading

DanielZhangQD commented Jun 2, 2026 •

edited

Loading

DanielZhangQD commented Jun 2, 2026 •

edited

Loading