Skip to content

[k8s] Better error message for stale jobs controller #5274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 25, 2025
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion sky/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from sky.serve import service_spec
from sky.skylet import constants
from sky.utils import common_utils
from sky.utils import controller_utils
from sky.utils import schemas
from sky.utils import ux_utils

Expand Down Expand Up @@ -323,7 +324,23 @@ def validate(self, workdir_only: bool = False):
if not workdir_only:
self.expand_and_validate_file_mounts()
for r in self.resources:
r.validate()
try:
r.validate()
except ValueError as e:
if self.managed_job_dag is not None:
# this task is a jobs controller
cluster_name = (controller_utils.Controllers.
JOBS_CONTROLLER.value.cluster_name)
logger.warning(
f'{colorama.Fore.YELLOW}Failed to validate jobs '
f'controller resource {r.repr_with_region_zone}.'
'\nIf the cluster exists, please check and allow '
'SkyPilot to connect. If this was a '
'previously launched cluster that was '
'taken down or removed, consider removing the '
'cluster from SkyPilot with:\n\n`sky down '
f'{cluster_name} --purge`{colorama.Style.RESET_ALL}')
raise e

def validate_name(self):
"""Validates if the task name is valid."""
Expand Down