Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
8f559cc
rsync: add --no-owner --no-group for both uploads and downloads (#8556)
Philmod Jan 15, 2026
abc30a0
[Slurm] Fix run_in_background option in CommandRunner (#8577)
kevinmingtarja Jan 15, 2026
97a7eb0
[Pools] Fix Secret Validation When Updating Pools (#8495)
lloyd-brown Jan 15, 2026
d61ea4e
[Dashboard] Add external link to Grafana in GPU Metrics section (#8599)
rohansonecha Jan 16, 2026
1e13c6b
Drop Python 3.7 and 3.8 support (#8489)
zpoint Jan 16, 2026
aa83d5c
[Helm] Allow RWX persistent storage with RollingUpdate upgrade strate…
Michaelvll Jan 16, 2026
ace7977
[Dashboard] Add waitForPlugins with requires_early_init support (#8584)
Michaelvll Jan 16, 2026
1ad2ed0
[Docs] Update GPU metrics docs to use Prometheus community chart (#8601)
rohansonecha Jan 16, 2026
6ad4552
Extend plugin for more functionality (#8602)
SeungjinYang Jan 16, 2026
0ff823b
[Volumes] Surface volume errors, revamp volume background refresh (#8…
romilbhardwaj Jan 17, 2026
a2779fb
[UX] Show cordon and taint info when showing gpu info (#8596)
DanielZhangQD Jan 18, 2026
f98bfb0
Plugin improvements to table column replacement (#8610)
SeungjinYang Jan 19, 2026
a17f57b
[Core] Support InfiniBand for Together AI (#8581)
DanielZhangQD Jan 19, 2026
4dfbe73
Do not persist sqlite db when rolling-update is enabled (#8607)
aylei Jan 19, 2026
ebedc74
[UX/Feature] Allowing an autostop hook that runs before the cluster a…
zpoint Jan 19, 2026
355fc52
Nightly build add job consolidation tests (#8159)
zpoint Jan 19, 2026
45deabc
Update Mistral docs link (#8623)
mk0walsk Jan 19, 2026
d63039c
[Vast] Fix SSH authentication by injecting SkyPilot public key into c…
liuwb Jan 19, 2026
b8e84a3
[Docs] Remove `sky status --k8s` from docs, update dashboard screensh…
romilbhardwaj Jan 19, 2026
adc86b9
[AWS] sky launch --infra aws -t p5e.48xlarge fails with Try specifyin…
otutukingsley Jan 19, 2026
2131001
[GCP] Feat Queued Resources (#8481)
m-braganca Jan 19, 2026
f51dcc7
[Auth] SSH key race condition in Lambda authentication setup (#8224)
atoniolo76 Jan 19, 2026
75ab2a2
[GKE] Avoid erroring out for unknown instance type when GKE autoscale…
Michaelvll Jan 19, 2026
3e79de0
[Volumes] Update volumes docs (#8612)
romilbhardwaj Jan 20, 2026
54e9048
Add examples for VeRL search (#8241)
Maknee Jan 20, 2026
ee3c3c3
Fix buildkite test plugin test support (#8625)
zpoint Jan 20, 2026
4206ef8
[Dashboard] Add glassy loading effect to infra page (#8624)
rohansonecha Jan 20, 2026
67fde52
Plugin url normalization (#8628)
SeungjinYang Jan 20, 2026
3526409
[Examples] Update dynamo example to work on k8s (#8582)
romilbhardwaj Jan 20, 2026
2ee6fe1
[lint] update mypy to 1.19.1 (#8613)
cg505 Jan 21, 2026
9cfbd9b
Fix flaky `test_container_logs_multinode_kubernetes` test (#8605)
zpoint Jan 21, 2026
c8c599c
[Docker] Better Handling for Docker Username (#8632)
kyuds Jan 21, 2026
b4d5204
[EFA] Auto EFA setup on EKS (#8557)
DanielZhangQD Jan 21, 2026
fb9a44b
[k8s] Documentation - Fix Line Number for ClusterRoleBinding and remo…
laimis9133 Jan 21, 2026
b28dd4c
[CLI] Fix usage of pandas>=3.0.0 in show-gpus (#8643)
kevinzwang Jan 21, 2026
54bfa00
Option to toggle plugin display in version info (#8640)
SeungjinYang Jan 21, 2026
c14239f
[CLI] Add hint for k8s nodes with GPU labels but zero GPU resources (…
kevinzwang Jan 21, 2026
8368df8
[Resources] Fix Accelerator Inference Issue with Resource Copy (#8648)
kyuds Jan 21, 2026
c9a2de1
Clusters Page and Plugin System Changes (#8611)
lloyd-brown Jan 21, 2026
a9bc93f
[CLI] Small refactor to avoid pandas 2.x issue (#8650)
nakinnubis Jan 22, 2026
21f308c
[Docs] Restrict overly permissive S3 IAM permissions (#8642)
Michaelvll Jan 22, 2026
a734930
[Core] Add NodeInfoSource extension point for cached Kubernetes node …
rohansonecha Jan 22, 2026
6a3214d
[API] Add compression to dashboard log downloads (#8626)
zpoint Jan 22, 2026
f352a76
[Catalog] Fix flaky github CI test failure `EmptyDataError` (#8633)
zpoint Jan 22, 2026
5c47213
Exit 1 if sky serve status fail (#8406)
zpoint Jan 22, 2026
589a8da
[auth] Add polling-based authentication for sky api login (#8590)
cg505 Jan 22, 2026
5ba6032
API server: honor auth user if present (#8654)
aylei Jan 22, 2026
3769e86
Update outdated documentation links (#8636)
isagi-y22 Jan 22, 2026
051cfaa
[Dashboard] Fix React hook dependency warnings and enforce zero warni…
cg505 Jan 22, 2026
7adfb97
Jobs Pagination Revamp (#8651)
lloyd-brown Jan 22, 2026
6a2aa19
[CI] Reduce format.sh output verbosity (#8652)
cg505 Jan 22, 2026
e56e672
[deprecate] remove lingering uses of X-Request-ID (#8107)
cg505 Jan 22, 2026
a65a990
[Core] Allow `ssh` even if ray cluster is in bad state (#8649)
kyuds Jan 22, 2026
9427783
[Config] Add config flag for pod resource limits (#8644)
Michaelvll Jan 23, 2026
c79a48d
[Test] Allow multiple `test_large_production_performance` to be run a…
zpoint Jan 23, 2026
349351d
[CLI] Add `--secret-file` option to cli (#8646)
kyuds Jan 23, 2026
c7d67a1
[jobs] Add JobGroup support for heterogeneous parallel workloads (#8456)
andylizf Jan 23, 2026
04004c9
[Docs] Add "Viewing GPU availability" section for Slurm (#8666)
kevinmingtarja Jan 23, 2026
06050e4
[Core] Add provision.install_conda config to disable conda install (#…
kevinzwang Jan 23, 2026
b11092d
Fix Pagination with Job Groups (#8664)
lloyd-brown Jan 23, 2026
baf60f2
[Test] Fix test_cluster_labels_in_status to actually fail on assertio…
cg505 Jan 23, 2026
7da24f2
[Slurm] Add test_slurm_storage_mounts_cached (#8670)
kevinmingtarja Jan 23, 2026
921caf2
[Dashboard] Round CPU and memory values to integers on infra page (#8…
rohansonecha Jan 24, 2026
75da63c
[Dashboard] Replace node count asterisk with warning icon for unreach…
rohansonecha Jan 24, 2026
e8c34b1
[Kubernetes] Fix _get_pod_termination_reason for containers with null…
kevinmingtarja Jan 24, 2026
84cbd12
[Docs] Update some NEW badges (#8679)
concretevitamin Jan 24, 2026
f0c38c4
feat(k8s): support multi-container (sidecar) SkyPilot pods (#8444)
php-workx Jan 24, 2026
9dce7fc
[k8s] Fix GPU labeller for L40S GPUs (#8593)
Michaelvll Jan 25, 2026
a1e55df
[Core] Auto-configure Windows SSH config when running in WSL (#8669)
Michaelvll Jan 25, 2026
905fe9d
[JobGroup] Fix networking on custom images without sudo (#8686)
romilbhardwaj Jan 26, 2026
4a06347
[Volume] Add unit tests for is_ephemeral PostgreSQL compatibility (#8…
zpoint Jan 26, 2026
b818986
[Kubernetes] Allow remote_identity override in task config (#8659)
zpoint Jan 26, 2026
d2d892a
[Docs] Improve jobs group docs (#8688)
Michaelvll Jan 26, 2026
8b21094
Improvement for dashboard, chart, and system user role (#8685)
DanielZhangQD Jan 26, 2026
f78d0a8
[Dashboard] Improve cluster detail and slurm context detail pages (#8…
DanielZhangQD Jan 26, 2026
60d0ec6
[RBAC] Collect RBAC rules for plugins before initializing permission …
DanielZhangQD Jan 26, 2026
d351f07
[Dashboard] Show Nodes as primary section on context detail page (#8675)
rohansonecha Jan 26, 2026
20965e8
Add greenlet to install_requires for sqlalchemy asyncio support (#8653)
oelachqar Jan 26, 2026
56f689f
[Pools] Improve Concurrent Job Launch (#7891)
lloyd-brown Jan 27, 2026
4a10950
[Test] Update Lambda smoke test to use A100 instead of A10 (#8706)
zpoint Jan 27, 2026
9c40936
[Test] Fix flaky test_aws_manual_restart_recovery (#8708)
zpoint Jan 27, 2026
1a9ea86
[Kubernetes] Increase WebSocket open timeout for SSH proxy (#8707)
zpoint Jan 27, 2026
0de4ccf
[Test] Add `no_auto_retry` marker for configuring Buildkite CI (#8687)
kevinmingtarja Jan 27, 2026
4f41697
[Slurm] Handle missing config file in slurm_node_info() (#8714)
kevinmingtarja Jan 27, 2026
b08b088
Fix Postgres Issue with Distinct (#8716)
lloyd-brown Jan 27, 2026
e941533
[Jobs][Dashboard] Fix JobGroup logs not displaying due to str/int ser…
cg505 Jan 27, 2026
0f984e3
[Slurm] Fix auth section on cluster yaml and generated local private …
kevinmingtarja Jan 27, 2026
eff6d58
[Nebius] Enable and configure UFW to fix CVE-2023-48022 (#8627)
SalikovAlex Jan 27, 2026
3205dbc
Fix External Link Support (#8717)
lloyd-brown Jan 27, 2026
e081fcc
[Slurm] Add support for pyxis/enroot containers (#8604)
kevinmingtarja Jan 28, 2026
6a7fe1c
[AWS] Change g7e accelerator name in catalog to RTXPRO6000 (#8721)
kevinzwang Jan 28, 2026
1e8de6d
[API Server] Fix Zip Slip vulnerability in /upload endpoint (#8723)
kevinmingtarja Jan 28, 2026
8ad4702
[Azure] Fix blobfuse2 mounting on Debian 13 (trixie) (#8730)
zpoint Jan 28, 2026
1173770
[Test] Fix test_skyserve_llm by Hard Coding Transformers Library (#8732)
lloyd-brown Jan 28, 2026
efd24b4
[AWS][Provisioner] Allow failover when vpc not found in region (#8734)
kyuds Jan 28, 2026
9ae175e
[Test] Fix logging errors and speed up optimizer CI tests (15min → 7…
zpoint Jan 29, 2026
cc8cc21
Plugin support for controllers (#8700)
SeungjinYang Jan 29, 2026
679fb7c
[Chart] Support disabling basic auth middleware (#8694)
DanielZhangQD Jan 29, 2026
d49ea0f
[Slurm] Add SSH support for containers (#8609)
kevinmingtarja Jan 29, 2026
6c03b30
[Test] Fix smoke test for kubernetes clusters with no GPUs and pools …
kevinmingtarja Jan 29, 2026
75fe5eb
[Test] Fail instead of skip when kubernetes cluster has no GPUs for G…
kevinmingtarja Jan 29, 2026
730a6d4
[Volume] Fail fast for not ready volumes (#8739)
DanielZhangQD Jan 29, 2026
ef25877
[Kubernetes] Fix service leak issue on k8s (#8745)
DanielZhangQD Jan 29, 2026
e28c6cf
[Slurm] Allow ssh-agent and default keys fallback for Slurm clusters …
kevinmingtarja Jan 29, 2026
f6da2c0
[AWS] Support multiple VPCs with failover (#8722)
kyuds Jan 29, 2026
fecaf90
[Catalog] Add local disk info to aws catalog (#8661)
kyuds Jan 29, 2026
46500d3
[k8s] fix race condition in k8s client construction (#8705)
cg505 Jan 30, 2026
58b3fb2
[PG] Use NullPool for PG async engines (#8725)
DanielZhangQD Jan 30, 2026
37f7ee0
[Dashboard] Add GPU metrics for managed jobs and job groups (#8718)
rohansonecha Jan 30, 2026
247061b
[Slurm] Fix multi node task execution when proctrack/cgroup is enable…
kevinmingtarja Jan 30, 2026
d948a93
[Slurm] Only setup ssh keys and bashrc inside container when using c…
kevinmingtarja Jan 30, 2026
49b4d21
[Dev] Add Cursor worktrees.json for automated workspace setup (#8748)
kevinzwang Jan 30, 2026
cb813c0
[CI] Fix critical OpenSSL vulnerability in Docker image (#8756)
kevinzwang Jan 30, 2026
fe4ef10
[Kubernetes] Fix error for the watch client (#8757)
DanielZhangQD Jan 30, 2026
270e643
Introduce proxy auth support (#8751)
aylei Jan 30, 2026
4897e24
[Usage] update USAGE_MESSAGE_REDACT_KEYS (#8759)
kevinmingtarja Jan 30, 2026
3e7f1eb
[Core] Add SKYPILOT_USER env var for jobs (#8747)
kevinzwang Jan 30, 2026
41b6a6f
Force delete pods during cluster launch if pods from previous cluster…
SeungjinYang Jan 30, 2026
a002c57
[Test] Fix flaky test_kubernetes_context_failover (#8760)
zpoint Jan 30, 2026
71c5163
[Pools] Add Support for Autoscaling (#8483)
lloyd-brown Jan 30, 2026
8bde36c
[Storage] Introduce `--graceful` flag for cluster operations (#8753)
kyuds Jan 31, 2026
fcba4c0
[Recipes] Initial Version (#8755)
lloyd-brown Jan 31, 2026
28fdd59
[Dashboard] Fix service account expiry input not accepting 0 (#8678)
romilbhardwaj Jan 31, 2026
195c726
[Test] Improve flaky test_interactive_auth_via_pty_and_unix_socket (#…
jason810496 Feb 1, 2026
2c2b377
[Tests] Add no Remote server to Recipe Tests (#8767)
lloyd-brown Feb 2, 2026
032d1a8
Skip dependency tests for plugin upload (#8768)
SeungjinYang Feb 2, 2026
8017afb
add yotta
panf2333 Jan 16, 2026
d09986f
lazy import
panf2333 Jan 19, 2026
95f8650
fix get_default_instance_type
panf2333 Jan 19, 2026
cf576d2
add disk size
panf2333 Jan 20, 2026
594bfa4
add UNSUPPORTED_FEATURES
panf2333 Jan 21, 2026
7cb70e5
update endpoint
panf2333 Jan 21, 2026
6c80b80
update port example
panf2333 Jan 22, 2026
1decc40
fix pr comment
panf2333 Jan 22, 2026
00472eb
fix tests/test_optimizer_dryruns.py::test_infer_cloud_from_region_or_…
panf2333 Jan 27, 2026
f1cf1c1
fix comment
panf2333 Feb 2, 2026
898b610
fix setup error
panf2333 Feb 2, 2026
b368795
add empty dependencies fro yotta
panf2333 Feb 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
43 changes: 31 additions & 12 deletions .buildkite/generate_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ def _parse_args(args: Optional[str] = None):
parser.add_argument('--grpc', action="store_true")
parser.add_argument('--env-file')
parser.add_argument('--plugin-yaml')
parser.add_argument('--submodule-base-branch')
parser.add_argument('--dependency', nargs='?', const='', default='all')

parsed_args, _ = parser.parse_known_args(args_list)
Expand Down Expand Up @@ -190,6 +191,11 @@ def _parse_args(args: Optional[str] = None):
extra_args.append('--grpc')
if parsed_args.env_file:
extra_args.append(f'--env-file {parsed_args.env_file}')
if parsed_args.plugin_yaml:
extra_args.append(f'--plugin-yaml {parsed_args.plugin_yaml}')
if parsed_args.submodule_base_branch:
extra_args.append(
f'--submodule-base-branch {parsed_args.submodule_base_branch}')
if parsed_args.dependency != 'all':
space = ' ' if parsed_args.dependency else ''
extra_args.append(f'--dependency{space}{parsed_args.dependency}')
Expand All @@ -198,8 +204,9 @@ def _parse_args(args: Optional[str] = None):


def _extract_marked_tests(
file_path: str, args: str
) -> Dict[str, Tuple[List[str], List[str], List[Optional[str]]]]:
file_path: str, args: str
) -> Dict[str, Tuple[List[str], List[str], List[Optional[str]], List[str],
List[bool]]]:
"""Extract test functions and filter clouds using pytest.mark
from a Python test file.

Expand All @@ -212,6 +219,10 @@ def _extract_marked_tests(
and run for hours. This makes it hard to visualize the test results and
rerun failures. Additionally, the parallelism would be controlled by pytest
instead of the buildkite job queue.

Returns:
Dict mapping function_name to tuple of:
(clouds, queues, params, extra_args, no_auto_retry_flags)
"""
# Args are already in the format pytest expects (cloud names like --lambda)
cmd = f'pytest {file_path} --collect-only {args}'
Expand Down Expand Up @@ -259,6 +270,7 @@ def _extract_marked_tests(
run_on_cloud_kube_backend = ('resource_heavy' in marks and
'kubernetes' in default_clouds_to_run)
benchmark_test = 'benchmark' in marks
no_auto_retry = 'no_auto_retry' in marks

for mark in marks:
if mark not in PYTEST_TO_CLOUD_KEYWORD:
Expand Down Expand Up @@ -302,20 +314,19 @@ def _extract_marked_tests(
for cloud in final_clouds_to_include
], param_list, [
extra_args for _ in range(len(final_clouds_to_include))
])
], [no_auto_retry for _ in range(len(final_clouds_to_include))])

return function_cloud_map


def _generate_pipeline(test_file: str,
args: str,
auto_retry: bool = False) -> Dict[str, Any]:
def _generate_pipeline(test_file: str, args: str) -> Dict[str, Any]:
"""Generate a Buildkite pipeline from test files."""
steps = []
generated_steps_set = set()
function_cloud_map = _extract_marked_tests(test_file, args)
for test_function, clouds_queues_param in function_cloud_map.items():
for cloud, queue, param, extra_args in zip(*clouds_queues_param):
for cloud, queue, param, extra_args, no_auto_retry in zip(
*clouds_queues_param):
label = f'{test_function} on {cloud}'
command = f'pytest {test_file}::{test_function} --{cloud}'
if param:
Expand All @@ -328,6 +339,7 @@ def _generate_pipeline(test_file: str,
continue
if 'PYTHON_VERSION' in os.environ:
command = f'PYTHONPATH="$PWD:$PYTHONPATH" {command}'

step = {
'label': label,
'command': command,
Expand All @@ -338,7 +350,15 @@ def _generate_pipeline(test_file: str,
'queue': queue
}
}
if auto_retry:
if no_auto_retry:
# Disable automatic retries but allow manual retries.
step['retry'] = {
'automatic': False,
'manual': {
'allowed': True
}
}
else:
step['retry'] = {
# Automatically retry 2 times on any failure by default.
'automatic': True
Expand Down Expand Up @@ -391,7 +411,7 @@ def _convert_release(test_files: List[str], args: str, trigger_command: str):
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
pipeline = _generate_pipeline(test_file, args, auto_retry=True)
pipeline = _generate_pipeline(test_file, args)
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
# Enable all clouds by default for release pipeline.
Expand Down Expand Up @@ -462,11 +482,10 @@ def _convert_quick_tests_core(test_files: List[str], args: str,
branch != 'master'):
continue
pipeline = _generate_pipeline(test_file,
args + f' --base-branch {branch}',
auto_retry=True)
args + f' --base-branch {branch}')
output_file_pipelines.append(pipeline)
else:
pipeline = _generate_pipeline(test_file, args, auto_retry=True)
pipeline = _generate_pipeline(test_file, args)
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
_dump_pipeline_to_file(yaml_file_path,
Expand Down
55 changes: 55 additions & 0 deletions .buildkite/test_buildkite_pipeline_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,61 @@ def _extract_test_names_from_pipeline(pipeline_path):
return test_names


def _extract_steps_from_pipeline(pipeline_path):
"""Extract all steps from a pipeline YAML file."""
with open(pipeline_path, 'r') as f:
pipeline = yaml.safe_load(f)

all_steps = []
for group in pipeline['steps']:
if 'steps' in group:
all_steps.extend(group['steps'])
else:
all_steps.append(group)
return all_steps


def test_no_auto_retry_marker():
"""Test that no_auto_retry marker works correctly.

This test uses the actual test_kubernetes_container_status_unknown_status_refresh
test which has the marker applied.
"""
# Generate pipeline for the specific test
env = dict(os.environ)
env['PYTHONPATH'] = f"{pathlib.Path.cwd()}/tests:{env.get('PYTHONPATH', '')}"

subprocess.run([
'python', '.buildkite/generate_pipeline.py', '--args', '--kubernetes',
'--file_pattern', 'test_cluster_job'
],
env=env,
check=True)

# Check the generated pipeline
pipeline_path = pathlib.Path('.buildkite/pipeline_smoke_tests_release.yaml')
steps = _extract_steps_from_pipeline(pipeline_path)

# Find steps for test_kubernetes_container_status_unknown_status_refresh
target_steps = [
s for s in steps
if 'test_kubernetes_container_status_unknown_status_refresh' in s.get(
'label', '')
]

# Should have exactly 1 step
assert len(target_steps) == 1, \
f"Expected 1 step, got {len(target_steps)}"

# Verify no_auto_retry is applied
step = target_steps[0]
retry = step.get('retry', {})
assert retry.get('automatic') is False, \
f"no_auto_retry step should have automatic=False: {retry}"
assert retry.get('manual', {}).get('allowed') is True, \
f"no_auto_retry step should allow manual retry: {retry}"


@pytest.mark.parametrize('args', [
'',
'--aws',
Expand Down
8 changes: 8 additions & 0 deletions .cursor/worktrees.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"setup-worktree": [
"uv venv --seed --python 3.11",
"uv pip install -e \".[all]\" --prerelease=allow",
"uv pip install -r requirements-dev.txt",
"npm --prefix sky/dashboard install && npm --prefix sky/dashboard run build"
]
}
2 changes: 1 addition & 1 deletion .github/workflows/compile-protos-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
python-version: ["3.9"]
steps:
- uses: actions/checkout@v3
- name: Install the latest version of uv
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
python-version: ["3.9"]
steps:
- uses: actions/checkout@v3
- name: Install the latest version of uv
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/mypy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
python-version: ["3.9"]
steps:
- uses: actions/checkout@v3
- name: Install the latest version of uv
Expand Down
Loading
Loading