add yotta by panf2333 · Pull Request #2 · yottalabsai/skypilot

panf2333 · 2026-01-19T08:12:40Z

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

…ilot-org#8556) * rsync: add --no-owner --no-group for both uploads and downloads skypilot-org#8549 * Update command_runner.py

)

Fix secret issue.

…ilot-org#8599) * [Dashboard] Add external link to Grafana in GPU Metrics section Add a button to the GPU Metrics header on the cluster detail page that opens the full Grafana GPU metrics dashboard in a new tab. The link includes the current time range and cluster name as query parameters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Dashboard] Refactor Grafana URL to use URLSearchParams Use URLSearchParams for cleaner URL construction as suggested in code review. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Drop Python 3.7 and 3.8 support * Update tests/test_ssh_proxy_lag.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update CONTRIBUTING.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * mypy * fix pytest * add more test log * [CLI] Disable pool status check when specific clusters are queried This PR's dependency upgrade (SQLAlchemy 2.0) increases database contention, causing 'sky status -r' to time out on file locks when concurrent 'pool_status' checks run. This fix disables the unnecessary pool checks for specific cluster queries, resolving the race condition. * update CI python version --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…gy (skypilot-org#8537) * [Helm] Allow RWX persistent storage with RollingUpdate upgrade strategy Previously, using RollingUpdate upgrade strategy with an external database required storage.enabled=false, causing file mounts and logs to be lost during rolling updates. This change allows persistent storage with RollingUpdate when using a ReadWriteMany (RWX) access mode storage class. RWX storage (like Google Filestore, AWS EFS, or Azure Files) allows multiple pods to mount the same PVC simultaneously during rolling updates. Changes: - Modified api-deployment.yaml to allow storage.enabled=true with RollingUpdate when storage.accessMode=ReadWriteMany - Updated values.yaml with comprehensive documentation about storage options and cloud-specific guidance for RWX storage classes - Updated unit tests to verify the new behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Helm] Persist managed job logs across rolling updates When using RWX storage with RollingUpdate strategy, persist only the specific managed job log directories to minimize storage usage: - /root/sky_logs/jobs_controller - Controller logs (sky jobs logs --controller) - /root/sky_logs/managed_jobs - Task execution logs Transient logs (api_server, sky-* cluster logs) are NOT persisted as they can be regenerated and would consume unnecessary storage. This ensures job logs remain accessible after rolling updates while keeping PVC storage usage minimal. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Helm] Skip file mount warning when persistent storage enabled Add SKYPILOT_STORAGE_ENABLED environment variable that is set when storage.enabled=true in Helm values. The file mounts rolling update warning is now skipped when persistent storage is enabled since file mounts are persisted to the PVC and survive rolling updates. Also update the warning message to mention persistent storage as an option and include a link to the Kubernetes deployment docs. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Helm] Rename storage env var to SKYPILOT_API_SERVER_STORAGE_ENABLED Rename SKYPILOT_STORAGE_ENABLED to SKYPILOT_API_SERVER_STORAGE_ENABLED to clarify that it controls persistent storage for the API server, including managed job logs, file mounts, and API server state. Changes: - Rename constant in sky/skylet/constants.py with enhanced documentation - Update usage in sky/jobs/server/core.py to use new constant name - Default to 'true' when env var is unset (for local deployments) - Update Helm template to always set env var explicitly to "true" or "false" - Add comprehensive documentation to helm-values-spec.rst explaining: - What data is persisted (managed job logs, file mounts, API server state) - Environment variable behavior and defaults - Storage requirements for RollingUpdate strategy Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Fix storage documentation to only mention persisted data Remove mentions of "API server state" and "SSH keys" from storage documentation. Only managed job logs and file mounts are actually persisted to the PVC when storage.enabled=true. Also add reference link to storage.accessMode in the upgradeStrategy section for better documentation navigation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Helm] Address review comments for RWX storage documentation - Remove explicit accessMode in test to verify default value behavior - Remove unnecessary "with an external database" phrase from values.yaml - Update helm-values-spec.rst with detailed storage.accessMode documentation including RWO/RWX compatibility with upgrade strategies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Helm] Fix formatting and sync with master - Fix line-too-long pylint issue in sky/jobs/server/core.py - Sync helm values and docs with master (remove deprecated unified ingress) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Helm] Regenerate values schema Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Helm] Restore unified ingress field accidentally removed Restore ingress.unified and grafana.ingress.annotations that were accidentally removed during format.sh sync with master. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ilot-org#8584) * [Dashboard] Add waitForPlugins with requires_early_init support Add a mechanism for plugins to signal when they need to initialize before dashboard API calls (e.g., for fetch interception). - Add `requires_early_init` property to BasePlugin (default: False) - Include `requires_early_init` in /api/plugins manifest response - PluginProvider adds `data-requires-early-init` attribute to script tags - client.js waits for window.__skyPluginsReady when attribute is present Zero latency for deployments without plugins needing early init. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix Prettier formatting in PluginProvider.jsx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…pilot-org#8601) - Replace Prometheus Operator guidance with community prometheus chart - Remove SkyPilot prometheus-server chart references - Add prometheus-values.yaml configuration with proper kube-state-metrics settings - Use release name 'skypilot-prometheus' to create required 'skypilot-prometheus-server' service - Add warning about Prometheus Operator's 'exported_' label prefix issue - Remove ServiceMonitor YAML file (Prometheus Operator CRD) - Remove Prometheus Operator screenshots from Nebius docs - Remove charts/external-metrics directory (no longer recommended) - Update publish-helm.yml workflow to remove external-metrics references Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Extend plugin for more functionality * gemini feedback

…ypilot-org#8524) * feat: Add PVC error reporting for volumes Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * feat: Add error message column to volumes Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor: Improve volume error reporting and add kubectl commands Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * feat: Add refresh option to volume list and update usage tracking Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor volume refresh logic and add refresh option to volume_list Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor volume list to use SHORT schedule type Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Test: Add k8s volume error handling and refresh tests Adds unit tests for Kubernetes volume error detection and the volume refresh functionality. This includes testing for PVC pending states, access mode mismatches, and lost PVCs. It also verifies that `volume_list` correctly uses the `refresh` flag and displays error messages. Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor: Read volume usedby from DB, not cloud APIs Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Add get_all_volumes_errors to kubernetes provision Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor: Rename VolumeStatus ERROR to NOT_READY Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * Refactor: Consolidate volume DB schema changes into one migration Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * [Dashboard] Use sentence case for volume status tooltip - Change StatusBadge to use NonCapitalizedTooltip instead of CustomTooltip - Remove className prop from StatusBadge tooltip to prevent style interference - Add normal-case class to NonCapitalizedTooltip to explicitly prevent capitalization - Strip className from props in NonCapitalizedTooltip to avoid CSS conflicts This ensures error messages are displayed in sentence case rather than having every word capitalized. Co-authored-by: romil.bhardwaj <romil.bhardwaj@gmail.com> * minor updates * address comments * update ut * Update error str * get pending reason from event --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: DanielZhang <zhanghailong810@aliyun.com>

) * show cordon and taint info when showing gpu info * address comments * count gpu of cordon nodes as not ready * update ut * count gpus on tainted node as not ready * update dashboard * add column node status * update style of node status column * add ut for show-gpus

* keys * reversion * edit existing columns * gemini comment 1 * gemini comment 2 * format

* support infiniband for together ai * update image and add ut * update image

…#8607) * Do not persist sqlite db when rolling-update is enabled Signed-off-by: Aylei <rayingecho@gmail.com> * Fix Signed-off-by: Aylei <rayingecho@gmail.com> * Fix typo Signed-off-by: Aylei <rayingecho@gmail.com> --------- Signed-off-by: Aylei <rayingecho@gmail.com>

…utostop (skypilot-org#8412) * autostop hook * backward compact * proto * 1h timeout * resolve PR comment * resolve skylet comment * add test case * fix skylet * fix test * fix bug * setup step * fix test failure * fix test failure * fix test failure * Update docs/source/reference/auto-stop.rst Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * resolve PR comment * resolve PR comment * add a AUTOSTOPPING state * fix status bug * support sky logs --autostop * replace with wandb sync example * abstract _handle_autostopping_cluster function * timeout support for hook * fix test failure * fix test failure * Update docs/source/reference/auto-stop.rst Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Update docs/source/reference/auto-stop.rst Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Update docs/source/reference/auto-stop.rst Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * resovle PR comment * add more tests * [CI] Fix helm publish workflow by allowing flexible separator length in README * Update sky/client/cli/command.py Co-authored-by: Christopher Cooper <christopher@cg505.com> * docstring * bump API VERSION * enforce that hook_timeout is only set if hook is set * hook * resolve check_cluster_available * remove returnval check * remove inherit * hook and hook timeout allow None * [Skylet] Fix backward compatibility for hook_timeout in get_autostop_config * resolve PR comment * -f string * remove getattr * remove proc kill * resolve comment * debug log * Revert refresh_autostop_status to is_definitely_autostopping * remove forward compact test * doc update * update user case * update example * [Test] Fix codegen snapshot tests for timeout handling Update snapshot tests to reflect simplified subprocess timeout cleanup logic. The proc.wait() retry block was removed upstream but snapshots were not updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Core] Fix AUTOSTOPPING status behavioral regression The AUTOSTOPPING status was added to show users when autostop hooks are executing. However, this created a behavioral regression where jobs/serve recovery was triggered unintentionally (since AUTOSTOPPING != UP). This fix treats AUTOSTOPPING as "UP but shutting down": - UP-like: SSH, URL, API access, controller operations work - STOPPED-like: New task submissions rejected, no recovery Changes: - backend_utils.py: Accept AUTOSTOPPING in is_controller_accessible() - execution.py: Reject new tasks on AUTOSTOPPING with clear error - cli/command.py: Accept AUTOSTOPPING in sky url - server/server.py: Accept AUTOSTOPPING in API SSH endpoint - jobs/controller.py: Skip recovery for AUTOSTOPPING - jobs/recovery_strategy.py: Skip recovery for AUTOSTOPPING - serve/replica_managers.py: Don't treat AUTOSTOPPING as preempted - serve/serve_utils.py: Return True for AUTOSTOPPING in is_cluster_up() Tests: - Add 5 unit tests for AUTOSTOPPING behavior - Add 1 consolidated smoke test for AUTOSTOPPING behaviors Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * doc update * [Test] Fix pytest errors for autostopping tests - Fix test_is_controller_accessible_accepts_autostopping: use Controllers.JOBS_CONTROLLER instead of non-existent Controllers.JOBS - Fix test_check_cluster_available_accepts_autostopping and test_check_cluster_available_rejects_init: add mock for get_backend_from_handle to avoid NotImplementedError - Update codegen snapshot files to match current formatting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix test codegen * restore doc change * [Core] Make sky launch wait for AUTOSTOPPING to complete When `sky launch` encounters a cluster in AUTOSTOPPING state, it now waits for the autostop process (including hook execution) to complete, then restarts the cluster. This prevents interrupting important cleanup work being done by the autostop hook. - Add polling loop with spinner in execution.py to wait for autostop - Make CLUSTER_STATUS_CACHE_DURATION_SECONDS public in backend_utils.py - Update test_launch_waits_for_autostopping smoke test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Core] Replace ArgumentValidationError with ValueError in autostop Use standard ValueError instead of custom ArgumentValidationError exception for hook_timeout validation, and remove the unused ArgumentValidationError class from exceptions.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * remove redundant warning * [Test] Fix test_launch_waits_for_autostopping spinner message capture Use `script` command to capture terminal output including spinner messages that use ANSI escape codes. The previous approach using shell command substitution `$()` doesn't capture spinner text since it's rendered in-place on the terminal rather than written as regular stdout lines. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Test] Fix --port typo to --ports in test_autostopping_behaviors The sky launch command uses --ports (plural), not --port. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Core] Fix UnboundLocalError for hook_timeout in _start Initialize hook_timeout to None at the beginning of _start function, similar to how hook is initialized. Without this, hook_timeout is only defined when controller autostop config exists, but it's referenced on line 628 for all clusters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Test] Fix test_autostopping_behaviors to use correct commands - Replace non-existent `sky url` with `sky status --endpoint 8080` - Fix SSH test to use `ssh {name}` instead of `sky ssh {name} --command` - Update docstring to reflect the actual tests Co-Authored-By: Claude <noreply@anthropic.com> * fix endpoint error * [Core] Reject sky exec on AUTOSTOPPING clusters Add check in exec() to reject task submission when cluster is in AUTOSTOPPING state, similar to how STOPPED clusters are rejected. This prevents jobs from being submitted to a cluster that is shutting down, which provides safer and more predictable behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Test] Fix timeout in test_autostopping_behaviors Reduce hook_duration from 300 to 120 seconds. The hook duration must be shorter than the autostop_timeout (250s) to allow the test to wait for the cluster to reach STOPPED status. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test autostop hook timeout * [Core] Clean up timeout logic in run_with_log and add unit tests Simplify the timeout handling in run_with_log by separating two cases: - With stream processing: timer is the effective timeout mechanism, proc.wait() is called without timeout since process already terminated - Without stream processing: proc.wait(timeout=timeout) is the primary timeout mechanism Add unit tests for run_with_log timeout functionality covering both process_stream=True and process_stream=False cases. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Test] Adjust autostop hook timeout test parameters - Increase hook_timeout from 5s to 30s to ensure AUTOSTOPPING state is visible during status polling - Increase hook_duration from 60s to 120s to ensure timeout triggers - Simplify launch validation to avoid false failures when restarting a just-stopped cluster Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix unit test --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Christopher Cooper <christopher@cg505.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* nightly build add job consolidation tests * update trigger args * log job submitted * [CI] Fix helm publish workflow by allowing flexible separator length in README

…ontainers (skypilot-org#8614)

…ots (skypilot-org#8603) * Update k8s docs + screenshots * add spacing

…g a different CPU count (skypilot-org#8465) * Fix AWS p5e.48xlarge instance type recognition (skypilot-org#8451) The AWS API returns incorrect accelerator info ('NVIDIA') for p5e.48xlarge instances, similar to p5en.48xlarge. This caused sky launch to fail when specifying p5e.48xlarge directly. Added p5e.48xlarge to the existing workaround that corrects the accelerator name to 'H200' with count 8. Both p5e.48xlarge and p5en.48xlarge have 8x H200 GPUs, so they should use the same correction logic. * Add unit tests for AWS p5e/p5en H200 instance type workaround * Remove test file that duplicates logic instead of testing actual code The test was duplicating the workaround logic from fetch_aws.py rather than testing the actual code, making it ineffective as a regression test. * Replace gcc with build-essential in Dockerfile build-essential includes gcc plus additional build tools needed for compiling Python packages with native extensions. * Use elif for mutually exclusive instance type checks Improves performance by avoiding unnecessary checks after a condition has been met. * address code reviews * fix formatting * Revert Dockerfile change (keep gcc instead of build-essential) The Dockerfile change is unrelated to the AWS p5e instance type fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* [GCP] Feat Queued Resources * Update sky/provision/gcp/instance_utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sky/resources.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sky/provision/gcp/instance_utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * chore: linting * reverted try except in waiting qr * reverted line deletion in example config * moved use_queued_resource to accelerator_args * renamed argument to gcp_queued_resource * reverted new line changes --------- Co-authored-by: m-braganca <m-braganca@instadeep.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…t-org#8224) Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

…r is used (skypilot-org#8326) * Avoid erroring out for unknown instance type when GKE autoscaler is used * format * Update sky/provision/kubernetes/utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* add NFS to title * wip * updates * update volumes * comments * update docs

* add examples * Uppdate * Search tooling * Search tooling * update readme with blog URL * add news * update example --------- Co-authored-by: makneee <makneee@node0.r650.sdcs-pg0.clemson.cloudlab.us> Co-authored-by: makneee <makneee@node1.r6615.sdcs-pg0.clemson.cloudlab.us> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

fix plugin test

* [Dashboard] Add glassy loading effect to infra page - Add CSS styles for shimmer animation, skeleton placeholders, row shimmer, and glass overlay with backdrop blur - Replace CircularProgress spinners with SkeletonBadge component for smoother loading experience - Add glass overlay to tables during refresh with proper handling for Kubernetes progressive loading - Fix row shimmer to work correctly for Slurm/SSH contexts by using hasGpuData instead of loadedContexts check - Add error handling for GPU data fetch promises to prevent infinite spinner when a context fails to load Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Address code review feedback - Use theme CSS variables (--border, --secondary) instead of hardcoded colors - Use bg-muted and bg-card instead of bg-gray-100 and bg-white - Use hover:bg-muted/50 instead of hover:bg-gray-50 for theme-aware hover - Extract isTableRefreshing variable to reduce code duplication - Use .finally() for promise cleanup to avoid duplicated pending count logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* plugin url normalization * remove unnecessary args

* [Examples] Update NVIDIA Dynamo examples to use image_id instead of docker-compose - Use official NGC container images (nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1) instead of docker-compose for running NATS/etcd services - Single-node: Use --store-kv file to skip etcd dependency, only run NATS - Multi-node: Run NATS and etcd on head node, workers connect via env vars - Remove setup phase that required Docker socket access - Update README with container image information and feature highlights - Upgrade Dynamo version from 0.4.1/0.5.0 to 0.7.1 for both examples This simplifies deployment and improves Kubernetes compatibility by removing the need for docker-compose inside pods. * [Examples] Add Kubernetes pod config and H200 support for Dynamo examples - Add config.kubernetes.pod_config to run container as root for rsync - Add H200 GPU support alongside H100 in accelerator requirements - Document Kubernetes configuration in README Tested successfully on CoreWeave H200 Kubernetes cluster. * Remove unnecessary comments from Dynamo examples --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* [lint] update mypy to 1.19.1 The bug that forced us to use --cache-dir=/dev/null is fixed in 1.16, so update to the latest version. Previously this was not possible because newer versions did not support 3.8, but now we have dropped 3.8. * remove --cache-dir null from pre-commit config

skypilot-org#8720) * add smoke test * improve test yaml * add fix * fix * add todo * check for ProctrackType * unit tests * improve comment

…ntainers (skypilot-org#8754) * [Slurm] Only setup ssh keys and bashrc inside container when using containers * simplify conditionals

…ot-org#8748) * [Dev] Add Cursor worktrees.json for automated workspace setup Adds a worktrees.json configuration file that automates the development environment setup for Cursor IDE worktrees. This runs the standard setup commands from CLAUDE.md: - Creates Python 3.11 virtual environment with uv - Installs the package in editable mode with all cloud support - Installs development dependencies - Builds the dashboard Co-authored-by: Cursor <cursoragent@cursor.com> * [Dev] Simplify worktrees.json setup commands Combine uv pip install commands and remove unnecessary venv activation. uv automatically detects and uses the .venv directory. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

…#8756) Add `apt-get upgrade` to update system packages in Stage 3 of the Dockerfile. This fixes CVE-2025-15467 (CRITICAL) in libssl3t64, openssl, and openssl-provider-legacy packages that was causing the "Scan Docker Image Vulnerabilities" CI workflow to fail on master. Co-authored-by: Cursor <cursoragent@cursor.com>

* fix error for the watch client * update ut

* Introduce proxy auth support Signed-off-by: Aylei <rayingecho@gmail.com> * Helm docs Signed-off-by: Aylei <rayingecho@gmail.com> * Update sky/server/config.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sky/server/config.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Simplify the doc Signed-off-by: Aylei <rayingecho@gmail.com> * Format Signed-off-by: Aylei <rayingecho@gmail.com> * Refine UT Signed-off-by: Aylei <rayingecho@gmail.com> --------- Signed-off-by: Aylei <rayingecho@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Expose the username of the user who launched the job via the SKYPILOT_USER environment variable. This is available in both setup and run scripts. - Add SKYPILOT_USER to _skypilot_predefined_env_vars() in backend - Document the new env var for both setup and run stages - Add smoke test to verify SKYPILOT_USER is set

… are still deleting (skypilot-org#8752) * force delete pods during cluster launch if pods from previous cluster is still deleting * Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * format --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [CI] Fix helm publish workflow by allowing flexible separator length in README * [Test] Fix flaky test_kubernetes_context_failover The test was failing intermittently because of a race condition between the API server's background on-boot check and test execution. When the unreachable_context fixture restarts the API server, it schedules a background `sky check` that runs asynchronously. This check includes the unreachable context which causes connection timeouts (30+ seconds). If the test runs before the background check completes, the enabled clouds cache is empty, causing "Kubernetes is not enabled" errors. Fix by running `sky check kubernetes` synchronously after the API server restart, ensuring the cache is populated before the test runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [Test] Add --infra kubernetes to enforce Kubernetes-only launches Without --infra kubernetes, the optimizer was free to choose any enabled cloud (e.g., Slurm), causing the test to fail when checking that clusters launched on specific Kubernetes contexts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Working autoscaling. * forma * Fixes. * Many fixes. * Fix some of the policy. * More fixes. * Simplify logic, add quick downscaling. * More simplification. * Change some schema validation. * Change spec. * Add unit tests. * Format * Fix queue length setting. * Fix updating. * Update sky/serve/service_spec.py Co-authored-by: Christopher Cooper <christopher@cg505.com> * Add todo about scaling factor. * Fix min replicas. --------- Co-authored-by: Christopher Cooper <christopher@cg505.com>

…t-org#8753)

* Initial commit. * Do file mount and workdir validation. * Add volume validation. * Use enums for better validation. * Remove categories. * Move category over. * Change recipe name structure. * Add examples. * Fix width issues. * Fix timestamp. * Add name to details page * Refactoring. * Make insert atomic. * Atomic DB operations. * Formatting. * Remove unnecessary files * Update tests. * Change title for tab. * Many dashboard fixes. * More dashboard fixes. * Formatting. * Fix type dropdown. * More fixes. * Improve email tooltip. * More fixes * Fix vertical scrolling. * Fix pool apply bug. * Do remote yaml fetching * Working volume launching. * Move name validation into unit tests. * Format. * Simplify recipe retrieval. * Formatting. * Formatting again * Fix unit tests. * Small change. * Add CI fix. * Reload config, * Change reload.

…t-org#8678)

…kypilot-org#8765) * [Test] Fix flaky test_interactive_auth_via_pty_and_unix_socket * Add exec_event to break the deadlock * Fix channel closure in interactive auth to prevent hangs * Add comments back Add missing comment

skip dependency tests for plugin upload

…zone - KeyError: 'AvailabilityZone'

Philmod and others added 30 commits January 15, 2026 10:30

rsync: add --no-owner --no-group for both uploads and downloads (skyp…

8f559cc

…ilot-org#8556) * rsync: add --no-owner --no-group for both uploads and downloads skypilot-org#8549 * Update command_runner.py

[Slurm] Fix run_in_background option in CommandRunner (skypilot-org#8577

abc30a0

)

[Pools] Fix Secret Validation When Updating Pools (skypilot-org#8495)

97a7eb0

Fix secret issue.

Extend plugin for more functionality (skypilot-org#8602)

6ad4552

* Extend plugin for more functionality * gemini feedback

Plugin improvements to table column replacement (skypilot-org#8610)

f98bfb0

* keys * reversion * edit existing columns * gemini comment 1 * gemini comment 2 * format

[Core] Support InfiniBand for Together AI (skypilot-org#8581)

a17f57b

* support infiniband for together ai * update image and add ut * update image

Nightly build add job consolidation tests (skypilot-org#8159)

355fc52

* nightly build add job consolidation tests * update trigger args * log job submitted * [CI] Fix helm publish workflow by allowing flexible separator length in README

Update Mistral docs link (skypilot-org#8623)

45deabc

[Vast] Fix SSH authentication by injecting SkyPilot public key into c…

d63039c

…ontainers (skypilot-org#8614)

[Docs] Remove sky status --k8s from docs, update dashboard screensh…

b8e84a3

…ots (skypilot-org#8603) * Update k8s docs + screenshots * add spacing

[Auth] SSH key race condition in Lambda authentication setup (skypilo…

f51dcc7

…t-org#8224) Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

[Volumes] Update volumes docs (skypilot-org#8612)

3e79de0

* add NFS to title * wip * updates * update volumes * comments * update docs

Fix buildkite test plugin test support (skypilot-org#8625)

ee3c3c3

fix plugin test

Plugin url normalization (skypilot-org#8628)

67fde52

* plugin url normalization * remove unnecessary args

kevinmingtarja and others added 27 commits January 30, 2026 01:09

[Slurm] Fix multi node task execution when proctrack/cgroup is enabled (

247061b

skypilot-org#8720) * add smoke test * improve test yaml * add fix * fix * add todo * check for ProctrackType * unit tests * improve comment

[Slurm] Only setup ssh keys and bashrc inside container when using co…

d948a93

…ntainers (skypilot-org#8754) * [Slurm] Only setup ssh keys and bashrc inside container when using containers * simplify conditionals

[Kubernetes] Fix error for the watch client (skypilot-org#8757)

fe4ef10

* fix error for the watch client * update ut

[Usage] update USAGE_MESSAGE_REDACT_KEYS (skypilot-org#8759)

4897e24

[Storage] Introduce --graceful flag for cluster operations (skypilo…

8bde36c

…t-org#8753)

[Dashboard] Fix service account expiry input not accepting 0 (skypilo…

28fdd59

…t-org#8678)

[Tests] Add no Remote server to Recipe Tests (skypilot-org#8767)

2c2b377

Skip dependency tests for plugin upload (skypilot-org#8768)

032d1a8

skip dependency tests for plugin upload

add yotta

8017afb

lazy import

d09986f

fix get_default_instance_type

95f8650

add disk size

cf576d2

add UNSUPPORTED_FEATURES

594bfa4

update endpoint

7cb70e5

update port example

6c80b80

fix pr comment

1decc40

fix tests/test_optimizer_dryruns.py::test_infer_cloud_from_region_or_…

00472eb

…zone - KeyError: 'AvailabilityZone'

fix comment

f1cf1c1

panf2333 force-pushed the add_yotta_cloud branch from 5b0878f to f1cf1c1 Compare February 2, 2026 13:57

panf2333 added 2 commits February 2, 2026 23:03

fix setup error

898b610

add empty dependencies fro yotta

b368795

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add yotta#2

add yotta#2
panf2333 wants to merge 138 commits intomasterfrom
add_yotta_cloud

panf2333 commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

panf2333 commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants