Skip to content

[k8s] Better error message for stale jobs controller #5274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kyuds
Copy link
Contributor

@kyuds kyuds commented Apr 18, 2025

Fixes #4268

Noticed that task.py is the only place that actually validates resources AND we can know for sure that the task in question is indeed a jobs controller.

Got feedback and changed to updating and checking cluster status on jobs.launch

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@romilbhardwaj romilbhardwaj requested a review from cg505 April 18, 2025 01:40
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also hit this in the case where the resources are just invalid, right? Is there some way we can check that the cluster already exists?
Also would love to figure out how to move this out of task.yaml, e.g. into jobs.launch. May require a new special exception type for invalid resources, or global_user_state checks. Maybe too hard.

@kyuds
Copy link
Contributor Author

kyuds commented Apr 19, 2025

so existing behavior will also throw an error when the resources are just invalid anyways (I run into this all the time, saying that kind doesn't have any resources that are 4+ CPU). The only difference with this particular error message is that it will ask the user if a certain kubernetes cluster exists or not.

Also would love to figure out how to move this out of task.yaml, e.g. into jobs.launch. May require a new special exception type for invalid resources, or global_user_state checks. Maybe too hard.

Edit: found a way to move things into jobs.launch, specifically using backend_utils.refresh_cluster_status_handle. Basically, I try to refresh to see if the cluster itself is alive, and if not, I will alert users appropriately.

@kyuds kyuds requested a review from cg505 April 19, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Jobs controller on stale context needs better error messages
3 participants