Skip to content

add yotta#2

Draft
panf2333 wants to merge 138 commits intomasterfrom
add_yotta_cloud
Draft

add yotta#2
panf2333 wants to merge 138 commits intomasterfrom
add_yotta_cloud

Conversation

@panf2333
Copy link
Copy Markdown
Collaborator

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Philmod and others added 30 commits January 15, 2026 10:30
…ilot-org#8556)

* rsync: add --no-owner --no-group for both uploads and downloads

skypilot-org#8549

* Update command_runner.py
…ilot-org#8599)

* [Dashboard] Add external link to Grafana in GPU Metrics section

Add a button to the GPU Metrics header on the cluster detail page that
opens the full Grafana GPU metrics dashboard in a new tab. The link
includes the current time range and cluster name as query parameters.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Dashboard] Refactor Grafana URL to use URLSearchParams

Use URLSearchParams for cleaner URL construction as suggested in code review.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Drop Python 3.7 and 3.8 support

* Update tests/test_ssh_proxy_lag.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update CONTRIBUTING.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* mypy

* fix pytest

* add more test log

* [CLI] Disable pool status check when specific clusters are queried

This PR's dependency upgrade (SQLAlchemy 2.0) increases database contention, causing 'sky status -r' to time out on file locks when concurrent 'pool_status' checks run. This fix disables the unnecessary pool checks for specific cluster queries, resolving the race condition.

* update CI python version

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…gy (skypilot-org#8537)

* [Helm] Allow RWX persistent storage with RollingUpdate upgrade strategy

Previously, using RollingUpdate upgrade strategy with an external database
required storage.enabled=false, causing file mounts and logs to be lost
during rolling updates.

This change allows persistent storage with RollingUpdate when using a
ReadWriteMany (RWX) access mode storage class. RWX storage (like Google
Filestore, AWS EFS, or Azure Files) allows multiple pods to mount the
same PVC simultaneously during rolling updates.

Changes:
- Modified api-deployment.yaml to allow storage.enabled=true with
  RollingUpdate when storage.accessMode=ReadWriteMany
- Updated values.yaml with comprehensive documentation about storage
  options and cloud-specific guidance for RWX storage classes
- Updated unit tests to verify the new behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Helm] Persist managed job logs across rolling updates

When using RWX storage with RollingUpdate strategy, persist only the
specific managed job log directories to minimize storage usage:
- /root/sky_logs/jobs_controller - Controller logs (sky jobs logs --controller)
- /root/sky_logs/managed_jobs - Task execution logs

Transient logs (api_server, sky-* cluster logs) are NOT persisted as
they can be regenerated and would consume unnecessary storage.

This ensures job logs remain accessible after rolling updates while
keeping PVC storage usage minimal.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Helm] Skip file mount warning when persistent storage enabled

Add SKYPILOT_STORAGE_ENABLED environment variable that is set when
storage.enabled=true in Helm values. The file mounts rolling update
warning is now skipped when persistent storage is enabled since
file mounts are persisted to the PVC and survive rolling updates.

Also update the warning message to mention persistent storage as an
option and include a link to the Kubernetes deployment docs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Helm] Rename storage env var to SKYPILOT_API_SERVER_STORAGE_ENABLED

Rename SKYPILOT_STORAGE_ENABLED to SKYPILOT_API_SERVER_STORAGE_ENABLED
to clarify that it controls persistent storage for the API server,
including managed job logs, file mounts, and API server state.

Changes:
- Rename constant in sky/skylet/constants.py with enhanced documentation
- Update usage in sky/jobs/server/core.py to use new constant name
- Default to 'true' when env var is unset (for local deployments)
- Update Helm template to always set env var explicitly to "true" or "false"
- Add comprehensive documentation to helm-values-spec.rst explaining:
  - What data is persisted (managed job logs, file mounts, API server state)
  - Environment variable behavior and defaults
  - Storage requirements for RollingUpdate strategy

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Fix storage documentation to only mention persisted data

Remove mentions of "API server state" and "SSH keys" from storage
documentation. Only managed job logs and file mounts are actually
persisted to the PVC when storage.enabled=true.

Also add reference link to storage.accessMode in the upgradeStrategy
section for better documentation navigation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Helm] Address review comments for RWX storage documentation

- Remove explicit accessMode in test to verify default value behavior
- Remove unnecessary "with an external database" phrase from values.yaml
- Update helm-values-spec.rst with detailed storage.accessMode documentation
  including RWO/RWX compatibility with upgrade strategies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Helm] Fix formatting and sync with master

- Fix line-too-long pylint issue in sky/jobs/server/core.py
- Sync helm values and docs with master (remove deprecated unified ingress)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Helm] Regenerate values schema

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Helm] Restore unified ingress field accidentally removed

Restore ingress.unified and grafana.ingress.annotations that were
accidentally removed during format.sh sync with master.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ilot-org#8584)

* [Dashboard] Add waitForPlugins with requires_early_init support

Add a mechanism for plugins to signal when they need to initialize
before dashboard API calls (e.g., for fetch interception).

- Add `requires_early_init` property to BasePlugin (default: False)
- Include `requires_early_init` in /api/plugins manifest response
- PluginProvider adds `data-requires-early-init` attribute to script tags
- client.js waits for window.__skyPluginsReady when attribute is present

Zero latency for deployments without plugins needing early init.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix Prettier formatting in PluginProvider.jsx

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…pilot-org#8601)

- Replace Prometheus Operator guidance with community prometheus chart
- Remove SkyPilot prometheus-server chart references
- Add prometheus-values.yaml configuration with proper kube-state-metrics settings
- Use release name 'skypilot-prometheus' to create required 'skypilot-prometheus-server' service
- Add warning about Prometheus Operator's 'exported_' label prefix issue
- Remove ServiceMonitor YAML file (Prometheus Operator CRD)
- Remove Prometheus Operator screenshots from Nebius docs
- Remove charts/external-metrics directory (no longer recommended)
- Update publish-helm.yml workflow to remove external-metrics references

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Extend plugin for more functionality

* gemini feedback
…ypilot-org#8524)

* feat: Add PVC error reporting for volumes

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* feat: Add error message column to volumes

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor: Improve volume error reporting and add kubectl commands

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* feat: Add refresh option to volume list and update usage tracking

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor volume refresh logic and add refresh option to volume_list

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor volume list to use SHORT schedule type

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Test: Add k8s volume error handling and refresh tests

Adds unit tests for Kubernetes volume error detection and the volume refresh functionality. This includes testing for PVC pending states, access mode mismatches, and lost PVCs. It also verifies that `volume_list` correctly uses the `refresh` flag and displays error messages.

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor: Read volume usedby from DB, not cloud APIs

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Add get_all_volumes_errors to kubernetes provision

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor: Rename VolumeStatus ERROR to NOT_READY

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* Refactor: Consolidate volume DB schema changes into one migration

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* [Dashboard] Use sentence case for volume status tooltip

- Change StatusBadge to use NonCapitalizedTooltip instead of CustomTooltip
- Remove className prop from StatusBadge tooltip to prevent style interference
- Add normal-case class to NonCapitalizedTooltip to explicitly prevent capitalization
- Strip className from props in NonCapitalizedTooltip to avoid CSS conflicts

This ensures error messages are displayed in sentence case rather than having
every word capitalized.

Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com>

* minor updates

* address comments

* update ut

* Update error str

* get pending reason from event

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: DanielZhang <zhanghailong810@aliyun.com>
)

* show cordon and taint info when showing gpu info

* address comments

* count gpu of cordon nodes as not ready

* update ut

* count gpus on tainted node as not ready

* update dashboard

* add column node status

* update style of node status column

* add ut for show-gpus
* keys

* reversion

* edit existing columns

* gemini comment 1

* gemini comment 2

* format
* support infiniband for together ai

* update image and add ut

* update image
…#8607)

* Do not persist sqlite db when rolling-update is enabled

Signed-off-by: Aylei <rayingecho@gmail.com>

* Fix

Signed-off-by: Aylei <rayingecho@gmail.com>

* Fix typo

Signed-off-by: Aylei <rayingecho@gmail.com>

---------

Signed-off-by: Aylei <rayingecho@gmail.com>
…utostop (skypilot-org#8412)

* autostop hook

* backward compact

* proto

* 1h timeout

* resolve PR comment

* resolve skylet comment

* add test case

* fix skylet

* fix test

* fix bug

* setup step

* fix test failure

* fix test failure

* fix test failure

* Update docs/source/reference/auto-stop.rst

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* resolve PR comment

* resolve PR comment

* add a AUTOSTOPPING state

* fix status bug

* support sky logs --autostop

* replace with wandb sync example

* abstract _handle_autostopping_cluster function

* timeout support for hook

* fix test failure

* fix test failure

* Update docs/source/reference/auto-stop.rst

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update docs/source/reference/auto-stop.rst

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update docs/source/reference/auto-stop.rst

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* resovle PR comment

* add more tests

* [CI] Fix helm publish workflow by allowing flexible separator length in README

* Update sky/client/cli/command.py

Co-authored-by: Christopher Cooper <christopher@cg505.com>

* docstring

* bump API VERSION

* enforce that hook_timeout is only set if hook is set

* hook

* resolve check_cluster_available

* remove returnval check

* remove inherit

* hook and hook timeout allow None

* [Skylet] Fix backward compatibility for hook_timeout in get_autostop_config

* resolve PR comment

* -f string

* remove getattr

* remove proc kill

* resolve comment

* debug log

* Revert refresh_autostop_status to is_definitely_autostopping

* remove forward compact test

* doc update

* update user case

* update example

* [Test] Fix codegen snapshot tests for timeout handling

Update snapshot tests to reflect simplified subprocess timeout cleanup
logic. The proc.wait() retry block was removed upstream but snapshots
were not updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Core] Fix AUTOSTOPPING status behavioral regression

The AUTOSTOPPING status was added to show users when autostop hooks
are executing. However, this created a behavioral regression where
jobs/serve recovery was triggered unintentionally (since
AUTOSTOPPING != UP).

This fix treats AUTOSTOPPING as "UP but shutting down":
- UP-like: SSH, URL, API access, controller operations work
- STOPPED-like: New task submissions rejected, no recovery

Changes:
- backend_utils.py: Accept AUTOSTOPPING in is_controller_accessible()
- execution.py: Reject new tasks on AUTOSTOPPING with clear error
- cli/command.py: Accept AUTOSTOPPING in sky url
- server/server.py: Accept AUTOSTOPPING in API SSH endpoint
- jobs/controller.py: Skip recovery for AUTOSTOPPING
- jobs/recovery_strategy.py: Skip recovery for AUTOSTOPPING
- serve/replica_managers.py: Don't treat AUTOSTOPPING as preempted
- serve/serve_utils.py: Return True for AUTOSTOPPING in is_cluster_up()

Tests:
- Add 5 unit tests for AUTOSTOPPING behavior
- Add 1 consolidated smoke test for AUTOSTOPPING behaviors

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* doc update

* [Test] Fix pytest errors for autostopping tests

- Fix test_is_controller_accessible_accepts_autostopping: use
  Controllers.JOBS_CONTROLLER instead of non-existent Controllers.JOBS
- Fix test_check_cluster_available_accepts_autostopping and
  test_check_cluster_available_rejects_init: add mock for
  get_backend_from_handle to avoid NotImplementedError
- Update codegen snapshot files to match current formatting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix test codegen

* restore doc change

* [Core] Make sky launch wait for AUTOSTOPPING to complete

When `sky launch` encounters a cluster in AUTOSTOPPING state, it now
waits for the autostop process (including hook execution) to complete,
then restarts the cluster. This prevents interrupting important cleanup
work being done by the autostop hook.

- Add polling loop with spinner in execution.py to wait for autostop
- Make CLUSTER_STATUS_CACHE_DURATION_SECONDS public in backend_utils.py
- Update test_launch_waits_for_autostopping smoke test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Core] Replace ArgumentValidationError with ValueError in autostop

Use standard ValueError instead of custom ArgumentValidationError
exception for hook_timeout validation, and remove the unused
ArgumentValidationError class from exceptions.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* remove redundant warning

* [Test] Fix test_launch_waits_for_autostopping spinner message capture

Use `script` command to capture terminal output including spinner
messages that use ANSI escape codes. The previous approach using
shell command substitution `$()` doesn't capture spinner text since
it's rendered in-place on the terminal rather than written as regular
stdout lines.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Test] Fix --port typo to --ports in test_autostopping_behaviors

The sky launch command uses --ports (plural), not --port.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Core] Fix UnboundLocalError for hook_timeout in _start

Initialize hook_timeout to None at the beginning of _start function,
similar to how hook is initialized. Without this, hook_timeout is only
defined when controller autostop config exists, but it's referenced
on line 628 for all clusters.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Test] Fix test_autostopping_behaviors to use correct commands

- Replace non-existent `sky url` with `sky status --endpoint 8080`
- Fix SSH test to use `ssh {name}` instead of `sky ssh {name} --command`
- Update docstring to reflect the actual tests

Co-Authored-By: Claude <noreply@anthropic.com>

* fix endpoint error

* [Core] Reject sky exec on AUTOSTOPPING clusters

Add check in exec() to reject task submission when cluster is in
AUTOSTOPPING state, similar to how STOPPED clusters are rejected.
This prevents jobs from being submitted to a cluster that is shutting
down, which provides safer and more predictable behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Test] Fix timeout in test_autostopping_behaviors

Reduce hook_duration from 300 to 120 seconds. The hook duration must be
shorter than the autostop_timeout (250s) to allow the test to wait for
the cluster to reach STOPPED status.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test autostop hook timeout

* [Core] Clean up timeout logic in run_with_log and add unit tests

Simplify the timeout handling in run_with_log by separating two cases:
- With stream processing: timer is the effective timeout mechanism,
  proc.wait() is called without timeout since process already terminated
- Without stream processing: proc.wait(timeout=timeout) is the primary
  timeout mechanism

Add unit tests for run_with_log timeout functionality covering both
process_stream=True and process_stream=False cases.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Test] Adjust autostop hook timeout test parameters

- Increase hook_timeout from 5s to 30s to ensure AUTOSTOPPING state is
  visible during status polling
- Increase hook_duration from 60s to 120s to ensure timeout triggers
- Simplify launch validation to avoid false failures when restarting
  a just-stopped cluster

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix unit test

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Christopher Cooper <christopher@cg505.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* nightly build add job consolidation tests

* update trigger args

* log job submitted

* [CI] Fix helm publish workflow by allowing flexible separator length in README
…g a different CPU count (skypilot-org#8465)

* Fix AWS p5e.48xlarge instance type recognition (skypilot-org#8451)

The AWS API returns incorrect accelerator info ('NVIDIA') for p5e.48xlarge
instances, similar to p5en.48xlarge. This caused sky launch to fail when
specifying p5e.48xlarge directly.

Added p5e.48xlarge to the existing workaround that corrects the accelerator
name to 'H200' with count 8. Both p5e.48xlarge and p5en.48xlarge have
8x H200 GPUs, so they should use the same correction logic.

* Add unit tests for AWS p5e/p5en H200 instance type workaround

* Remove test file that duplicates logic instead of testing actual code

The test was duplicating the workaround logic from fetch_aws.py rather
than testing the actual code, making it ineffective as a regression test.

* Replace gcc with build-essential in Dockerfile

build-essential includes gcc plus additional build tools needed for
compiling Python packages with native extensions.

* Use elif for mutually exclusive instance type checks

Improves performance by avoiding unnecessary checks after a condition
has been met.

* address code reviews

* fix formatting

* Revert Dockerfile change (keep gcc instead of build-essential)

The Dockerfile change is unrelated to the AWS p5e instance type fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* [GCP] Feat Queued Resources

* Update sky/provision/gcp/instance_utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update sky/resources.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update sky/provision/gcp/instance_utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* chore: linting

* reverted try except in waiting qr

* reverted line deletion in example config

* moved use_queued_resource to accelerator_args

* renamed argument to gcp_queued_resource

* reverted new line changes

---------

Co-authored-by: m-braganca <m-braganca@instadeep.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…r is used (skypilot-org#8326)

* Avoid erroring out for unknown instance type when GKE autoscaler is used

* format

* Update sky/provision/kubernetes/utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* add NFS to title

* wip

* updates

* update volumes

* comments

* update docs
* add examples

* Uppdate

* Search tooling

* Search tooling

* update readme with blog URL

* add news

* update example

---------

Co-authored-by: makneee <makneee@node0.r650.sdcs-pg0.clemson.cloudlab.us>
Co-authored-by: makneee <makneee@node1.r6615.sdcs-pg0.clemson.cloudlab.us>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [Dashboard] Add glassy loading effect to infra page

- Add CSS styles for shimmer animation, skeleton placeholders, row shimmer,
  and glass overlay with backdrop blur
- Replace CircularProgress spinners with SkeletonBadge component for smoother
  loading experience
- Add glass overlay to tables during refresh with proper handling for
  Kubernetes progressive loading
- Fix row shimmer to work correctly for Slurm/SSH contexts by using hasGpuData
  instead of loadedContexts check
- Add error handling for GPU data fetch promises to prevent infinite spinner
  when a context fails to load

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Address code review feedback

- Use theme CSS variables (--border, --secondary) instead of hardcoded colors
- Use bg-muted and bg-card instead of bg-gray-100 and bg-white
- Use hover:bg-muted/50 instead of hover:bg-gray-50 for theme-aware hover
- Extract isTableRefreshing variable to reduce code duplication
- Use .finally() for promise cleanup to avoid duplicated pending count logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* plugin url normalization

* remove unnecessary args
* [Examples] Update NVIDIA Dynamo examples to use image_id instead of docker-compose

- Use official NGC container images (nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1)
  instead of docker-compose for running NATS/etcd services
- Single-node: Use --store-kv file to skip etcd dependency, only run NATS
- Multi-node: Run NATS and etcd on head node, workers connect via env vars
- Remove setup phase that required Docker socket access
- Update README with container image information and feature highlights
- Upgrade Dynamo version from 0.4.1/0.5.0 to 0.7.1 for both examples

This simplifies deployment and improves Kubernetes compatibility by removing
the need for docker-compose inside pods.

* [Examples] Add Kubernetes pod config and H200 support for Dynamo examples

- Add config.kubernetes.pod_config to run container as root for rsync
- Add H200 GPU support alongside H100 in accelerator requirements
- Document Kubernetes configuration in README

Tested successfully on CoreWeave H200 Kubernetes cluster.

* Remove unnecessary comments from Dynamo examples

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
* [lint] update mypy to 1.19.1

The bug that forced us to use --cache-dir=/dev/null is fixed in 1.16,
so update to the latest version. Previously this was not possible
because newer versions did not support 3.8, but now we have dropped
3.8.

* remove --cache-dir null from pre-commit config
kevinmingtarja and others added 27 commits January 30, 2026 01:09
skypilot-org#8720)

* add smoke test

* improve test yaml

* add fix

* fix

* add todo

* check for ProctrackType

* unit tests

* improve comment
…ntainers (skypilot-org#8754)

* [Slurm] Only setup ssh keys  and bashrc inside container when using containers

* simplify conditionals
…ot-org#8748)

* [Dev] Add Cursor worktrees.json for automated workspace setup

Adds a worktrees.json configuration file that automates the development
environment setup for Cursor IDE worktrees. This runs the standard setup
commands from CLAUDE.md:
- Creates Python 3.11 virtual environment with uv
- Installs the package in editable mode with all cloud support
- Installs development dependencies
- Builds the dashboard

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Dev] Simplify worktrees.json setup commands

Combine uv pip install commands and remove unnecessary venv activation.
uv automatically detects and uses the .venv directory.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
…#8756)

Add `apt-get upgrade` to update system packages in Stage 3 of the
Dockerfile. This fixes CVE-2025-15467 (CRITICAL) in libssl3t64,
openssl, and openssl-provider-legacy packages that was causing the
"Scan Docker Image Vulnerabilities" CI workflow to fail on master.

Co-authored-by: Cursor <cursoragent@cursor.com>
* fix error for the watch client

* update ut
* Introduce proxy auth support

Signed-off-by: Aylei <rayingecho@gmail.com>

* Helm docs

Signed-off-by: Aylei <rayingecho@gmail.com>

* Update sky/server/config.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update sky/server/config.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Simplify the doc

Signed-off-by: Aylei <rayingecho@gmail.com>

* Format

Signed-off-by: Aylei <rayingecho@gmail.com>

* Refine UT

Signed-off-by: Aylei <rayingecho@gmail.com>

---------

Signed-off-by: Aylei <rayingecho@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Expose the username of the user who launched the job via the
SKYPILOT_USER environment variable. This is available in both
setup and run scripts.

- Add SKYPILOT_USER to _skypilot_predefined_env_vars() in backend
- Document the new env var for both setup and run stages
- Add smoke test to verify SKYPILOT_USER is set
… are still deleting (skypilot-org#8752)

* force delete pods during cluster launch if pods from previous cluster is still deleting

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* format

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CI] Fix helm publish workflow by allowing flexible separator length in README

* [Test] Fix flaky test_kubernetes_context_failover

The test was failing intermittently because of a race condition
between the API server's background on-boot check and test execution.

When the unreachable_context fixture restarts the API server, it
schedules a background `sky check` that runs asynchronously. This
check includes the unreachable context which causes connection
timeouts (30+ seconds). If the test runs before the background
check completes, the enabled clouds cache is empty, causing
"Kubernetes is not enabled" errors.

Fix by running `sky check kubernetes` synchronously after the API
server restart, ensuring the cache is populated before the test runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* [Test] Add --infra kubernetes to enforce Kubernetes-only launches

Without --infra kubernetes, the optimizer was free to choose any
enabled cloud (e.g., Slurm), causing the test to fail when checking
that clusters launched on specific Kubernetes contexts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Working autoscaling.

* forma

* Fixes.

* Many fixes.

* Fix some of the policy.

* More fixes.

* Simplify logic, add quick downscaling.

* More simplification.

* Change some schema validation.

* Change spec.

* Add unit tests.

* Format

* Fix queue length setting.

* Fix updating.

* Update sky/serve/service_spec.py

Co-authored-by: Christopher Cooper <christopher@cg505.com>

* Add todo about scaling factor.

* Fix min replicas.

---------

Co-authored-by: Christopher Cooper <christopher@cg505.com>
* Initial commit.

* Do file mount and workdir validation.

* Add volume validation.

* Use enums for better validation.

* Remove categories.

* Move category over.

* Change recipe name structure.

* Add examples.

* Fix width issues.

* Fix timestamp.

* Add name to details page

* Refactoring.

* Make insert atomic.

* Atomic DB operations.

* Formatting.

* Remove unnecessary files

* Update tests.

* Change title for tab.

* Many dashboard fixes.

* More dashboard fixes.

* Formatting.

* Fix type dropdown.

* More fixes.

* Improve email tooltip.

* More fixes

* Fix vertical scrolling.

* Fix pool apply bug.

* Do remote yaml fetching

* Working volume launching.

* Move name validation into unit tests.

* Format.

* Simplify recipe retrieval.

* Formatting.

* Formatting again

* Fix unit tests.

* Small change.

* Add CI fix.

* Reload config,

* Change reload.
…kypilot-org#8765)

* [Test] Fix flaky test_interactive_auth_via_pty_and_unix_socket

* Add exec_event to break the deadlock

* Fix channel closure in interactive auth to prevent hangs

* Add comments back

Add missing comment
skip dependency tests for plugin upload
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.