[k8s] Better error message for stale jobs controller #5274

kyuds · 2025-04-18T01:34:18Z

~~Noticed that task.py is the only place that actually validates resources AND we can know for sure that the task in question is indeed a jobs controller.~~

Got feedback and changed to updating and checking cluster status on jobs.launch

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

sky/task.py

cg505

We could also hit this in the case where the resources are just invalid, right? Is there some way we can check that the cluster already exists?
Also would love to figure out how to move this out of task.yaml, e.g. into jobs.launch. May require a new special exception type for invalid resources, or global_user_state checks. Maybe too hard.

kyuds · 2025-04-19T03:10:02Z

so existing behavior will also throw an error when the resources are just invalid anyways (I run into this all the time, saying that kind doesn't have any resources that are 4+ CPU). The only difference with this particular error message is that it will ask the user if a certain kubernetes cluster exists or not.

Also would love to figure out how to move this out of task.yaml, e.g. into jobs.launch. May require a new special exception type for invalid resources, or global_user_state checks. Maybe too hard.

Edit: found a way to move things into jobs.launch, specifically using backend_utils.refresh_cluster_status_handle. Basically, I try to refresh to see if the cluster itself is alive, and if not, I will alert users appropriately.

kyuds added 3 commits April 18, 2025 09:56

initial implementation

1a65a30

styling

7e0e0f6

listen to chatgpt

374fbe5

romilbhardwaj requested a review from cg505 April 18, 2025 01:40

SeungjinYang reviewed Apr 18, 2025

View reviewed changes

sky/task.py Outdated Show resolved Hide resolved

cg505 reviewed Apr 18, 2025

View reviewed changes

move jobs controller cluster check to jobs.launch

615c0a9

kyuds requested a review from cg505 April 19, 2025 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Better error message for stale jobs controller #5274

[k8s] Better error message for stale jobs controller #5274

kyuds commented Apr 18, 2025 •

edited

Loading

cg505 left a comment

kyuds commented Apr 19, 2025 •

edited

Loading

[k8s] Better error message for stale jobs controller #5274

Are you sure you want to change the base?

[k8s] Better error message for stale jobs controller #5274

Conversation

kyuds commented Apr 18, 2025 • edited Loading

cg505 left a comment

Choose a reason for hiding this comment

kyuds commented Apr 19, 2025 • edited Loading

kyuds commented Apr 18, 2025 •

edited

Loading

kyuds commented Apr 19, 2025 •

edited

Loading