Skip to content

[Workspaces] Auto-select / persist per-user default workspace#9755

Merged
DanielZhangQD merged 17 commits into
masterfrom
sky-5468-workspace-default
Jun 3, 2026
Merged

[Workspaces] Auto-select / persist per-user default workspace#9755
DanielZhangQD merged 17 commits into
masterfrom
sky-5468-workspace-default

Conversation

@DanielZhangQD

@DanielZhangQD DanielZhangQD commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add per-user preferred_workspace so users with access to multiple workspaces stop getting cryptic "permission denied to 'default'" on launch, and can persist a default once via the CLI instead of editing ~/.sky/config.yaml on every machine.

The fix targets users for whom the implicit 'default' is not valid (no access). Legacy multi-workspace users and admins for whom 'default' is valid keep today's behavior — no surprise regressions on upgrade.

Resolver

sky/workspaces/core.py:resolve_workspace_for_user() with documented precedence:

  1. explicit--workspace flag / active_workspace in merged client config
  2. preferred_workspace — if still accessible (RBAC re-validated at use; drifted preferences fall through with a note explaining why)
  3. 'default' if accessible — back-compat safety net (without this, every existing multi-ws user and every admin would AMBIGUOUS on upgrade)
  4. exactly one accessible workspace — auto-select
  5. zero accessibleNoWorkspaceAccessError
  6. multiple, no preference, no 'default' accessWorkspaceAmbiguousError (returns guidance, never silently picks)

Resolver runs in override_request_env_and_config behind three skip gates: daemon requests, old clients (MIN_PREFERRED_WORKSPACE_API_VERSION = 53), and explicitly-set active_workspace. Reads preferred from the User dataclass already loaded by add_or_update_user(return_user=True) — no extra DB read per request. Net hot-path increment over master is one batch Casbin call.

CLI

  • sky workspace use <name> — set preferred (RBAC-validated, errors with 403/404 surfacing)
  • sky workspace use --clear — unset
  • --workspace/-w on sky launch and sky jobs launch — sugar over --config active_workspace=X via a thin click callback
  • sky workspace info — Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.

Endpoint

  • POST /users/me/workspace { preferred: str|null } — echoes the persisted value back
  • GET — Shows the workspace your next request lands in by default, plus your saved preferred and the workspaces you can access.

Tested

Unit tests (55 pass, ~9s):

  • test_resolve_workspace_for_user.py — every precedence branch + RBAC drift + admin case + exception class signatures
  • test_resolver_compat_matrix.py — 41-case parametric matrix asserting the back-compat invariant: every case where the legacy server succeeded still succeeds on the new resolver with the same workspace pick when the user has not opted in
  • test_preferred_workspace_endpoints.pyPOST happy path, 401 / 403 / 404 surfaces, version-gate constant, daemon / old-client / explicit skip gates on the executor
  • test_cli_workspace.pysky workspace use, --workspace flag callback, AMBIGUOUS message round-trip via CLI
  • test_viewer_route_coverage.pyPOST registered as viewer-denied

Local manual verification:

  • Set/clear preferred workspace
$ sky workspace use team-b
Set preferred workspace to 'team-b'.
$ sky workspace info | grep Preferred
   ├── Preferred workspace: 'team-b'
$ sky workspace use --clear
Cleared preferred workspace.
$ sky workspace info | grep Preferred
   ├── Preferred workspace: (not set)
  • No active_workspace set in Server
    • No active_workspace set in Client
      • Users with User role
        • 0 accessible ws
        • 1 accessible ws - default
        • 1 accessible ws - non-default
        • N accessible ws - default included
        • N accessible ws - default not included
      • Users with admin role
        • 1 ws
        • N ws
  • active_workspace is set in client, use this one
    • Preferred workspace is set or not set
    • active_workspace is set in server
  • active_workspace is set in the server (and not in the client), use this one
    • Preferred workspace is set or not set

bash format.sh clean on changed files (pylint 9.94/10, no new diagnostics).

🤖 Generated with Claude Code

@DanielZhangQD DanielZhangQD marked this pull request as ready for review May 29, 2026 10:34
gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@DanielZhangQD DanielZhangQD force-pushed the sky-5468-workspace-default branch from cd95497 to 418ba4d Compare June 2, 2026 06:03
@DanielZhangQD

Copy link
Copy Markdown
Collaborator Author

/smoke-test --kubernetes -k test_workspace_switching
/smoke-test --kubernetes -k test_workspace_e2e_via_cli

@DanielZhangQD

Copy link
Copy Markdown
Collaborator Author

/quicktest-core --kubernetes --base-branch master

DanielZhangQD and others added 12 commits June 2, 2026 20:53
Add per-user `preferred_workspace` so users with access to multiple
workspaces don't get cryptic "permission denied to default" errors on
launch, and can persist a default once via the CLI instead of editing
`~/.sky/config.yaml` on every machine.

What it does
------------
* New column `users.preferred_workspace` (migration 019, additive,
  default null).
* New resolver `sky/workspaces/core.py:resolve_workspace_for_user()` with
  documented precedence:
  1. explicit (--workspace flag / active_workspace in merged config)
  2. preferred_workspace (if still accessible — re-validated at use)
  3. 'default' if accessible (back-compat safety net for legacy
     multi-workspace users and admins; without this, every existing
     multi-ws user would AMBIGUOUS on upgrade)
  4. exactly one accessible workspace — auto-select
  5. zero accessible — NoWorkspaceAccessError
  6. multiple, no preference, no 'default' access —
     WorkspaceAmbiguousError (returns guidance, never silently picks)
* Resolver wired into `override_request_env_and_config` with three skip
  gates: daemon requests, old clients
  (MIN_PREFERRED_WORKSPACE_API_VERSION), and explicitly-set
  active_workspace. Reads preferred from the User dataclass already
  loaded by `add_or_update_user(return_user=True)` — no extra DB read
  per request.
* Endpoint `POST /users/me/workspace { preferred: str|null }` echoes
  the persisted value back. No GET — preferred ships inline on the User
  row returned by `/api/health`.
* CLI:
  - `sky workspace use <name>` / `sky workspace use --clear`
  - `--workspace/-w` on `sky launch` and `sky jobs launch` (sugar over
    `--config active_workspace=X` via a thin click callback)
  - `sky api info` shows `Preferred workspace: 'X'` (or `(not set)`),
    read from the existing `/api/health` payload
* D8 relaxation: cluster-op workspace check now uses RBAC access, not
  active-workspace equality — without this, switching preferred would
  lose ability to manage previously-launched clusters.
* `WorkspaceAmbiguousError` carries an optional `note` for RBAC drift
  ("preferred 'team-x' not accessible") so users understand why their
  preference wasn't honored.

What it does NOT include
------------------------
* No dashboard UI for setting preferred. First iteration shipped a
  user-menu picker; dropped because (a) the OSS sidebar is replaced
  wholesale by feature-plugin's Sidebar.tsx in enterprise installs, so
  the picker code is dead there, and (b) a flat list of N workspaces
  in the user menu is the wrong density for a setting users touch
  once. CLI-only entry point is sufficient. A future `/settings`-page
  surface can read the same `/api/health` field and POST to the same
  endpoint with no server change.
* Sky Serve workspace-awareness is not touched (separate scope).
* Admin assignment (SKY-4210) deferred — additive follow-up on the
  same column.

Test plan
---------
Unit tests (55 pass):
* `test_resolve_workspace_for_user.py` — every precedence branch +
  RBAC drift + admin case + exception class signatures
* `test_resolver_compat_matrix.py` — 41-case parametric matrix
  asserting the back-compat invariant: every case where the legacy
  server succeeded still succeeds on the new resolver with the same
  workspace pick, when the user has not opted in
* `test_preferred_workspace_endpoints.py` — POST happy path, 401/403/
  404 surfaces, version-gate constant, daemon/old-client/explicit
  skip gates on the executor
* `test_cli_workspace.py` — `sky workspace use`, `--workspace` flag
  callback, AMBIGUOUS message round-trip via CLI
* `test_viewer_route_coverage.py` — POST registered as viewer-denied

Smoke (`test_workspace_e2e_via_cli`): end-to-end CLI →
`/users/me/workspace` → resolver → cluster.workspace recording, with
three resolver branches exercised (preferred, explicit, default-
fallback) and `sky api info` round-tripping the persisted value.

The smoke test `test_workspace_switching` is updated for D8 (no
mismatch error on cross-workspace cluster ops; access check only).

Local: `bash format.sh` clean on changed files (pylint 9.94/10).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`sky api info` now prints a `Preferred workspace: (not set)` line between
the `User:` and `Endpoint` lines (reads `preferred_workspace` from the
/api/health user row). The existing test asserted indexes against the
pre-change layout and broke in CI:

  Python Tests - CLI Tests / Limited Deps - Jobs, Serve & CLI (3.9)
  tests/test_cli.py::TestWithNoCloudEnabled::test_cli_api_info
  AssertionError: assert '├── Preferre...ce: (not set)' == '└── Endpoint...'

Shift indexes: Endpoint line moves from output[4] to output[5], total
length 6 → 7. Add an explicit assertion on the new line so future
formatting changes catch their own regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five fixes prompted by Gemini and Devin review comments on the PR, plus
their regression tests:

1) ContextVar does not propagate to worker processes — resolver gate
   never fires (Devin)
   `_should_apply_workspace_resolver` read the client API version via
   `versions.get_remote_api_version()`, a `contextvars.ContextVar` set
   by `APIVersionMiddleware` in the FastAPI dispatch process. Request
   execution actually runs in a worker spawned by `BurstableExecutor`
   (a `ProcessPoolExecutor`); ContextVar values do not propagate
   across process boundaries, so the worker always saw `None` and the
   gate treated every request as "old client" — silently disabling
   the new resolver.
   Fix: add `client_api_version: Optional[int]` to `RequestBody`
   defaulting to `server_constants.API_VERSION` at client construction
   time. The field is serialized to the request DB and read back in
   the worker, so the gate has the real value without depending on
   ContextVar propagation. The gate signature becomes
   `(is_daemon, client_api_version)`. Old clients (pre-this-PR) do not
   set the field; Pydantic defaults to None and the gate treats them
   as "old client" — same legacy semantics as before. Drops the now
   unused `versions` import from `executor.py`.

2) `set_preferred_workspace` SDK method missing
   `@versions.minimal_api_version(...)` decorator (Devin)
   New client talking to an older server hit a confusing HTTP 404 on
   the unknown `/users/me/workspace` route instead of a clean
   `APINotSupportedError`. Added the decorator pointing at
   `MIN_PREFERRED_WORKSPACE_API_VERSION` so the version-skew error
   surfaces with the SDK method name.

3) `WorkspaceAmbiguousError` corrupted by pickle round-trip (Devin)
   The exception's executor-side pickle path
   (`sky/server/requests/serializers/encoders.py`) reconstructs it on
   the client. Python's default exception pickle calls
   `cls(*self.args)` where `self.args` is the formatted message
   string — `sorted(message_string)` then sorted individual
   characters into `accessible`, and `super().__init__` rebuilt a
   garbled guidance message. Added `__reduce__` returning
   `(cls, (self.accessible, self.note))` so the original constructor
   arguments survive the round-trip. (Matches the pattern used by
   `ExecutionRetryableError`.)

4) Drop dead `get_user_preferred_workspace` from `global_user_state.py`
   (Gemini)
   Gemini suggested `.scalar()` over `filter_by(...).first().attr`,
   but the function had no remaining callers — resolver now reads
   `user.preferred_workspace` directly from the `User` dataclass
   already loaded by `add_or_update_user(return_user=True)`. Deleted
   the function entirely (also moots the `.scalar()` suggestion).

5) `tests/unit_tests/test_lifecycle_hooks.py` was brittle
   (incidental — surfaced when this PR bumped API_VERSION)
   `test_tail_hook_logs_minimal_api_version_required` mocked the
   remote version to `server_constants.API_VERSION - 1`. That formula
   only fires the `tail_hook_logs` decorator (`@minimal_api_version
   (52)`) when API_VERSION is exactly 52; this PR bumped it to 53, so
   `53 - 1 = 52` no longer trips the gate and the test fell through
   to a real cluster lookup. Pinned the mock to `51` (one less than
   the function's own minimum) so the test is not coupled to global
   API_VERSION bumps.

Regression tests added:
* `TestSetPreferredWorkspaceSdkVersionGate` — exercises the
  decorator chain on the SDK function with a pinned old remote
  version.
* `TestRequestBodyCarriesClientApiVersion` — three tests covering
  default population, explicit override, and Pydantic JSON round-trip
  of the new field.
* `TestWorkspaceAmbiguousErrorSignature.test_pickle_roundtrip...` —
  pickles the exception with accessible + note, asserts both
  attributes and `str(e)` survive intact.

Validation:
* 60+ UTs pass locally; `bash format.sh` clean (pylint 9.94/10).
* Pickle round-trip verified empirically with `python -c` repro
  before/after the `__reduce__` fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`reviewable` CI job runs `git diff --exit-code
agent/skills/skypilot/references/` after `agent/scripts/generate_references.py`
and was failing on this PR because the new surface (--workspace flag on
sky launch / sky jobs launch, sky.set_preferred_workspace SDK) was not
reflected in the auto-generated reference docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Devin review caught that `local_active_workspace_ctx` in
sky/skypilot_config.py does not wrap its `yield` in try/finally, so a
caller whose `with` block raises leaks the new thread-local workspace
value into subsequent code paths in the same worker process
(ProcessPoolExecutor reuses workers).

This is a pre-existing latent bug, not introduced by this PR — but
this PR's new executor caller in `override_request_env_and_config`
deliberately raises PermissionDeniedError from
`reject_request_for_unauthorized_workspace` for workspace-rejection,
making the leak path part of normal flow rather than an edge case.

The practical impact on the resolver gate is masked because
`override_skypilot_config` already rebinds the global
`_active_workspace_context` to a fresh `threading.local()` at the
start of every request, so the leak does not actually corrupt the
next request's gate. But the function itself should be correct
contextmanager shape, and the other ~10 callers (sky check, core
launch, jobs server, serve server, backend_utils) also benefit.

Fix: wrap the yield in try/finally so the original workspace is
restored even when an exception escapes.

Regression test added: TestIsActiveWorkspaceSet
.test_local_active_workspace_ctx_cleanup_on_exception — uses a
synthetic exception inside `with local_active_workspace_ctx(...)`
and asserts the post-condition equals the pre-condition value.
Verified the test fails when the try/finally is removed (revert
check via standalone repro).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related bugs caught by Devin's review on the previous push, both
guarded by new regression tests.

1) WorkspaceAmbiguousError fails JSON deserialization (multiple values
   for 'accessible')

   The request executor's exception transit is NOT direct pickle — it
   is `serialize_exception` -> dict {args, attributes, ...} ->
   `pickle_and_encode(dict)` -> request DB -> `decode_and_unpickle`
   -> `deserialize_exception` which reconstructs via
   `cls(*args, **attributes)`. For this exception:

     args        = (formatted_message,)  # from super().__init__(msg)
     attributes  = {'accessible': [...], 'note': ...}  # from __dict__
     ctor        = (accessible, note=None)

   `cls(formatted_message, accessible=[...], note=...)` -> TypeError:
   __init__() got multiple values for argument 'accessible'. Result:
   instead of the helpful "You belong to multiple workspaces… run
   `sky workspace use`" guidance, the affected client sees a TypeError
   — the feature is unusable for the exact users it targets.

   The previous `__reduce__` override I added only fixed the direct
   pickle path; the JSON serialize path runs through
   `serialize_exception` which ignores `__reduce__` entirely.

   Fix: extend SkyPilotExcludeArgsBaseException (existing marker in
   sky/exceptions.py) so `serialize_exception` zeros out `args`. Then
   deserialize calls `cls(**attributes)` cleanly. The `__reduce__`
   override is kept for the direct-pickle path (defense in depth).

   Regression test
   `test_serialize_deserialize_roundtrip_via_request_executor_path`
   exercises the actual serialize_exception → deserialize_exception
   path and asserts args==() + full str(e) round-trips.

2) `client_api_version` server-side default-fill bypasses the gate
   for old clients

   `RequestBody.__init__` unconditionally filled `client_api_version`
   with `server_constants.API_VERSION` if absent. Correct on the
   CLIENT side (the client stamps its own version). WRONG on the
   SERVER side: when an OLD client (pre-this-PR) sends a body
   omitting the field, Pydantic deserialization calls `__init__`
   again on the server, and the server's API_VERSION (53) gets
   stamped — making the executor gate think it's a new client. The
   resolver runs and may raise `WorkspaceAmbiguousError`, which the
   old client cannot deserialize.

   Fix: only apply the default-fill when `annotations.is_on_api_server`
   is False (i.e., inside a `@client_api` SDK wrapper). On the server
   side the field stays at its Pydantic-declared `None` default for
   missing input, which the executor gate correctly interprets as
   "old client → skip resolver".

   Regression tests
   `test_server_side_parse_of_old_client_body_leaves_field_none` and
   `test_server_side_parse_of_new_client_body_preserves_field` cover
   both legs of the wire format (missing field stays None; present
   field is preserved). `test_default_construction_populates_local_api_version`
   was renamed to `test_client_construction_stamps_local_api_version`
   and now explicitly flips `is_on_api_server` to False to mimic the
   `@client_api` runtime context.

Validation:
* 62 UTs pass locally (60 + 2 new server-side parse tests).
* Empirical repro before/after both fixes verified with `python -c`
  scripts driving `serialize_exception/deserialize_exception` and
  `RequestBody.model_validate_json` with the old-client JSON shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three related fixes uncovered while taking the dashboard end-to-end
through `test00` (a user with no 'default' workspace access).

1) Header-driven client version (replaces in-body client-side stamping)

Previously `RequestBody.__init__` stamped `client_api_version` on
the client side (gated by `annotations.is_on_api_server`). That
worked for the Python SDK but left the dashboard out — dashboard
JS does not construct `RequestBody`. Dropping the guard would not
help either, because server-side Pydantic deserialization of an
old client's body would then silently populate the field with the
server's own API_VERSION.

Single-source-of-truth design: the SDK already sends
`X-SkyPilot-API-Version` / `X-SkyPilot-Version` headers on every
request; `APIVersionMiddleware` parses them and sets the
`_remote_api_version` ContextVar in the FastAPI dispatch context.
`prepare_request_async` now reads that ContextVar and stamps it
onto the request body before persisting — one place, every
transport. Client-side `__init__` stamping is removed.

For the dashboard, the same two version headers are added to all
outgoing requests through `apiClient` plus three direct-fetch
sites that bypass apiClient and POST to queued routes
(`/jobs/download_logs`, `ssh_node_pools/{deploy,down}`).
`_check_version_compatibility` requires BOTH headers, so the
dashboard sends a placeholder readable version (`'dashboard;'`)
alongside `CLIENT_API_VERSION`.

2) Thread-local restoration in `local_active_workspace_ctx`

The previous try/finally fix still restored the wrong value on
exit: `get_active_workspace()` falls back to the literal
`SKYPILOT_DEFAULT_WORKSPACE` ('default') when nothing is set, so
a worker that handled one request entered the ctx with no
attribute and exited with the attribute SET to 'default'. The
next request from the same worker (especially dashboard requests
with empty `override_skypilot_config`, which short-circuits the
global rebind in `override_skypilot_config`) saw
`is_active_workspace_set()` == True, skipped the resolver, and
rejected users without 'default' access.

Now restores the thread-local attribute's STATE (had / didn't
have), not the resolved string. Downstream `get_active_workspace`
still chains to layer-2 / layer-3 fallbacks correctly.

3) CLI / SDK no longer hard-code 'default' into outgoing requests

`sky status` (cli/command.py) and `sdk.enabled_clouds` read
`get_active_workspace()` client-side and passed the result as an
explicit `?workspace=...`. With nothing set on the client this
resolved to the literal 'default' and was stamped on the wire as
explicit intent — which the server's resolver gate (c) honors,
skipping the resolver and rejecting users without 'default'
access. Both sites now send the workspace only when
`is_active_workspace_set()` is True; otherwise None and the
server-side resolver picks. `_show_enabled_infra` accepts
`Optional[str]` and skips the `(workspace: 'X')` annotation when
None, since the resolved workspace happens server-side.

Regression tests
* `test_local_active_workspace_ctx_removes_attr_when_originally_unset`
  — enter with no prior attribute, exit, assert the attribute is
  GONE and `is_active_workspace_set` reports False.
* `TestPrepareRequestAsyncStampsClientApiVersion` — exercises the
  ContextVar -> body stamping with header-present and -absent.
* Rewrote `TestRequestBodyCarriesClientApiVersion` to assert no
  client-side stamping happens in `__init__`.

Validation
* 65 UT pass locally.
* End-to-end: `sky status` and the dashboard at
  `test00:test00@127.0.0.1:46580` (user with access only to `pub`)
  now succeed where they previously returned
  `PermissionDeniedError: ... 'default'`. Server log confirms
  `resolved workspace 'pub' from single-membership for user test00`
  on every queued request.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fault' access

When a managed job is submitted, the recorded `workspace` was read via
`get_active_workspace(force_user_workspace=True)`, which deliberately
bypasses the per-thread workspace context and falls back to the
config-side value (or the literal `'default'` when nothing is set).
That hack predates the per-user workspace resolver introduced by this
PR: the resolver now sets the thread-local at the request boundary
to whichever workspace the user actually intends for this request, so
force-bypassing it instead records the legacy `'default'` literal —
silently breaking users who have no access to the `'default'`
workspace (the exact scenario SKY-5468 targets).

Concretely, a user with access only to `'pub'` would submit a
managed job and see:
  cluster_table.workspace  = 'pub'    (recorded via the resolver-set
                                       thread-local)
  job_info_table.workspace = 'default' (recorded via
                                        force_user_workspace=True)
and the controller would then reject the job at the
`reject_request_for_unauthorized_workspace` check.

Fix: drop `force_user_workspace=True` at the two job-submission sites
that write `job_info.workspace`:

* `_maybe_submit_job_locally` (consolidation mode, jobs/server/core.py).
  The call point is before any nested
  `local_active_workspace_ctx(...)` is entered, so the thread-local
  here is the resolver's pick.
* `_submit_remotely` (non-consolidation mode, jobs/server/core.py).
  `_ensure_controller_up` has already entered AND exited its own
  controller-provisioning workspace ctx by this line, so the
  thread-local is restored to the outer (resolver-set) workspace.

`_exec_code_on_head` also reads `force_user_workspace=True` to populate
`ManagedJobInfo.workspace` on the gRPC `QueueJobRequest`, but the
skylet-side handler in `services.py::QueueJob` does not currently
read `request.managed_job` at all — the protobuf field is defined for
future use but is dead on the wire today. Left untouched aside from a
short comment so a future skylet that wires it up can find the
context.

Validation
* Existing UTs pass (65).
* `sky status` for a `test00` user with no `default` workspace access
  succeeds end-to-end against the local API server, where it
  previously hit
  `PermissionDeniedError: User test00 does not have permission to
  access workspace 'default'` on the managed-jobs view path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…surfaces

Auth middlewares (BasicAuth, AuthProxy, BearerToken, plugin-supplied)
populate request.state.auth_user from headers/JWT/credentials. The
header-only paths (AuthProxy and BearerToken) build the User from id +
name + type and never load the persisted preferred_workspace column.

Worker-side reads were already fine — executor.prepare_request_async
runs add_or_update_user(return_user=True) before the resolver sees the
user — but /api/health is synchronous and returned the middleware-
built User verbatim, which made `sky api info` print "(not set)" even
right after `sky workspace use <name>` succeeded.

Refresh the User from the DB inside health() so the response reflects
the persisted preference regardless of which middleware authenticated
the request.

Tests: 3 new tests in test_server.py covering the refresh, the
DB-miss fallback, and the unauthenticated path's skip of the lookup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The executor's reload_for_new_request pipeline refreshes config and the
request-scoped lru cache on every queued worker request, but plugin
endpoints that respond as sync FastAPI handlers (i.e. not queued via
executor.schedule_request_async / prepare_request_async) bypass it.

Their process-cached workspace config snapshot can therefore lag behind
a worker process that just ran sky workspace create/update — a sync
handler filtering by accessible_workspaces will exclude newly-created
workspaces and include workspaces that have since been removed.

Add a refresh_workspace_state_for_sync_handler() helper next to
reload_for_new_request that plugin sync handlers call before reading
workspace/RBAC state. It does the minimum subset of the executor's
reload: safe_reload_config() to pick up new YAML/DB state, plus
clear_request_level_cache() to flush the per-request _load_workspaces
lru cache so the fresh config is observed.

Casbin's in-memory enforcer is refreshed lazily by
permission_service.get_user_roles (called inside
get_accessible_workspace_names), so it doesn't need a separate refresh
here.

Companion fix in feature-plugin wires this into the pagination and
secrets-manager plugins; both have sync handlers that read
workspaces_core.get_workspaces() / check_workspace_permission and were
returning stale lists post-workspace-update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@DanielZhangQD DanielZhangQD force-pushed the sky-5468-workspace-default branch from 9f3d5a2 to bcadb83 Compare June 2, 2026 12:53
@DanielZhangQD

DanielZhangQD commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/quicktest-core --kubernetes --base-branch master
https://buildkite.com/skypilot-1/quicktest-core/builds/3329#_
/smoke-test --kubernetes -k test_workspace_switching
https://buildkite.com/skypilot-1/smoke-tests/builds/11078#_

@DanielZhangQD DanielZhangQD requested a review from aylei June 2, 2026 13:09

@aylei aylei left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM!

Comment thread sky/client/cli/command.py Outdated
@DanielZhangQD

DanielZhangQD commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/smoke-test --kubernetes
https://buildkite.com/skypilot-1/smoke-tests/builds/11079#_

@DanielZhangQD

Copy link
Copy Markdown
Collaborator Author

/quicktest-core --kubernetes --base-branch master

@DanielZhangQD DanielZhangQD merged commit f67ca60 into master Jun 3, 2026
22 of 23 checks passed
@DanielZhangQD DanielZhangQD deleted the sky-5468-workspace-default branch June 3, 2026 05:35
DanielZhangQD added a commit that referenced this pull request Jun 4, 2026
… shared constants (#9797)

* [Workspaces] Polish per-user resolver UX: scoped error, surfaced log, shared constants

Follow-ups after the per-user workspace resolver merge (#9755):

* AMBIGUOUS recovery message no longer advertises `--workspace` as a
  universal fix — it only exists on `sky launch` / `sky jobs launch`,
  so callers running `sky status` / `sky queue` etc. were being pointed
  at a flag that doesn't apply. The three recovery options now read as
  parallel choices instead of replacements ("Or, for a one-shot
  override...").
* On `sky launch` / `sky jobs launch`, when the resolver implicitly
  picks PREFERRED or SINGLE_MEMBERSHIP, log "Using workspace 'X'
  (source: ...)" at INFO so users see where their cluster / job
  actually landed. EXPLICIT (user-named) and DEFAULT_FALLBACK
  (pre-existing implicit behavior) stay silent to avoid noise on the
  common cases. The whitelist correctly uses the prefixed task name
  (`sky.launch`) — the previous draft compared against the bare enum
  value and never matched at runtime.
* `sky workspace info` payload contract: when the resolver raises
  WorkspaceAmbiguousError / NoWorkspaceAccessError / PermissionDenied,
  the handler now returns a state-coded `source`
  (ambiguous / no-access / permission-denied) plus a short single-line
  `note`, instead of dumping the multi-line exception message into
  `note` (which broke the CLI tree renderer). The CLI appends the
  AMBIGUOUS recovery hint as a separate paragraph below the tree, via
  a `WorkspaceAmbiguousError.recovery_hint()` classmethod shared
  between the exception and the CLI.
* Move the `WORKSPACE_SOURCE_*` constants to a new leaf module
  `sky/workspaces/constants.py` so client-side callers (CLI) no longer
  drag the resolver's server-only deps (check, global_user_state,
  backend_utils, permission, resource_checker) into a read path.

Tested:
* `pytest tests/unit_tests/test_sky/test_cli_workspace.py
   tests/unit_tests/test_sky/users/test_preferred_workspace_endpoints.py
   tests/unit_tests/test_sky/workspaces/test_resolve_workspace_for_user.py
   tests/unit_tests/test_sky/server/requests/test_executor.py
   tests/unit_tests/test_sky/workspaces/test_resolver_compat_matrix.py`
   — 103 passed, including new regressions:
  - bare `'launch'` (no `sky.` prefix) MUST NOT match the whitelist
    (pins the prefix protocol from both sides)
  - DEFAULT_FALLBACK must stay silent (locks the new silent set)
  - AMBIGUOUS message keeps `--workspace` only in the launch footnote
  - ambiguous / no-access / permission-denied payloads carry their
    state code + short note

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* address comment

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
DanielZhangQD added a commit that referenced this pull request Jun 8, 2026
* [Jobs] Pin active_workspace to cluster row in terminate_cluster

The jobs controller's `terminate_cluster` (used by both cancel and
recovery-teardown paths) calls `core.down` directly. `core.down` →
`backend.teardown` → `_teardown` does a RAW
`backend_utils.check_owner_identity` (cloud_vm_ray_backend.py:4519,
not auto-wrapped). The controller runs in the system/daemon process —
the executor's per-request resolver doesn't fire here (skipped by the
daemon gate in `_should_apply_workspace_resolver`), so
`skypilot_config.get_active_workspace()` falls back to 'default' and
the owner-identity check raises for any cluster whose recorded
workspace is not 'default':

    ClusterOwnerIdentityMismatchError: The cluster 'e82-X' is in
    workspace 'pub', but the active workspace is 'default'.

Pre-existing latent bug, surfaced more often after #9755 (per-user
preferred workspace) because users no longer need an explicit
`active_workspace:` in `~/.sky/config.yaml` to land on a non-default
workspace.

Fix: in `terminate_cluster`, look up the cluster row once before the
retry loop and wrap `core.down` in
`local_active_workspace_ctx(cluster.workspace)` — the same pattern
`_update_cluster_status_no_lock` already uses for its owner check.
Falls back to `nullcontext()` when the cluster row is gone (core.down
will raise `ClusterDoesNotExist` immediately and we return — no
workspace to pin).

Scope:
* Only `terminate_cluster` is fixed because it's the only controller-
  side caller that goes through a RAW (non-auto-wrapped) owner check.
* `controller.py:1935 core.status` and `:1947 core.cancel` route their
  owner check through `_update_cluster_status_no_lock`, which already
  auto-wraps in `local_active_workspace_ctx(record['workspace'])`.
* `recovery_strategy.py`'s `sdk.launch` / `sdk.exec` / `sdk.cancel`
  go HTTP → executor → resolver. The controller's process env carries
  the original submitter's `SKYPILOT_USER_ID` / `SKYPILOT_USER`
  (injected by `controller_utils.controller_only_vars_to_fill`), so
  the executor builds the original `User` row including its
  `preferred_workspace` and the resolver picks the right workspace.
* SkyServe replicas + jobs pool reuse the same controller-env
  injection and use `sdk.down` (HTTP) for teardown, so no equivalent
  bug there.

Tested:
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
  — 6 passed (4 existing + 2 new):
  - `test_terminate_cluster_pins_active_workspace_from_cluster_record`
    mocks `get_cluster_from_name` to return `workspace='team-a'`,
    captures `get_active_workspace()` at the `core.down` call time
    and asserts 'team-a'. Revert-check: drop the `with workspace_ctx:`
    wrap → assertion flips.
  - `test_terminate_cluster_no_record_skips_workspace_pin` confirms
    the function doesn't crash when the cluster row is already gone.
* Manual: launched a managed job in workspace 'pub' with
  `preferred_workspace='pub'`. Both `sky jobs cancel` and triggered
  recovery (force-deleted underlying VM) now tear down + re-provision
  cleanly. Without the fix both surfaced
  ClusterOwnerIdentityMismatchError.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Jobs] Construct workspace_ctx inside retry loop

`skypilot_config.local_active_workspace_ctx` is a
`@contextlib.contextmanager` generator and cannot be re-entered. The
previous version built the ctx once outside the `while True:` retry
loop — if `core.down` fails on the first try and the loop retries,
the second `with workspace_ctx:` raises `RuntimeError` (from the
spent generator) and masks the real underlying failure.

Move the ctx construction inside the loop so each retry attempt gets
a fresh generator. The DB lookup for `cluster_workspace` stays
outside the loop (cluster workspace is immutable for the life of the
cluster, no point re-querying per retry).

Caught by gemini-code-assist review on PR #9808.

Tested:
* New `test_terminate_cluster_retry_reenters_workspace_ctx`:
  mocks `get_cluster_from_name` to return `workspace='team-a'` (so
  the live ctx path is exercised, not `nullcontext()`), fails
  `core.down` twice then succeeds. Records `get_active_workspace()`
  inside each attempt and asserts `['team-a', 'team-a', 'team-a']`.
  Confirmed FAILS on the prior code with
  `AttributeError: '_GeneratorContextManager' object has no
  attribute 'args'` — exactly the symptom Gemini predicted. PASSES
  after this fix.
* `pytest tests/unit_tests/test_jobs_utils.py -k terminate_cluster`
  — 7 passed (5 prior + the new retry-reentry test).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants