-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart Kueue if new CRDs become available #1058
Restart Kueue if new CRDs become available #1058
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ChristianZaccaria i am not sure why this solution is not done in kueue instead of moving it in Operator? |
Do you mean in the Kueue repository? Ultimately, we are deleting the Kueue pod when Ray OR Training Operator CRDs are available, of which then the DSC brings the Kueue pod back. |
yes, i was thinking Kueue has its controller then it should have some Watches(). function already. It gets easier if they can self-maintain these resources. The main reason Kserve has such check to servicemesh serverless and authroino is because these are dependent operators and will be used as common platform resources. At the moment, only Kserve is using such but later on other components will rely on some of these as well. One more thing, if there is no dependency between kueue and TO or Ray, should you reconsider how these CRD should be applied? |
@zdtsw that makes sense to me. I'll attempt this solution on Kueue directly, just thinking if I should bring this to upstream Kueue or only on downstream repos.
I think this is clear to the user, moreover, these components are set to |
sure, by default we set these are Managed because they are GA component. |
@ChristianZaccaria it should be ideally addressed in community. |
@zdtsw Thanks Wen, you're right. We'll have this clearly documented for the users. |
@sutaakar what do you think of re-enabling this PR: opendatahub-io/kueue#33 but for Community. I'm not sure if the community will agree on restarting the Kueue pod. So, I think I can rework the mentioned PR to start the decoupled controller and webhook on CRD availability, and restart them on new CRDs becoming available. - Similarly as done on this PR, except Kueue pod doesn't restart, but the job framework controller and webhook do. WDYT? |
@ChristianZaccaria I think it is worth opening an issue in Kueue GitHub, suggesting there this approach. Lets see what they think about it. WDYT? |
@ChristianZaccaria if we do not need this PR, can we close it? |
Closing: Solution to be implemented upstream in Kueue community. |
Many thanks for submitting your Pull Request ❤️!
Please make sure that your PR meets the following requirements:
Description
Kueue can function without those CRDs, however it requires to be restarted to work with the other components such as KubeRay and Training Operator.
JIRA issue:
Jira: https://issues.redhat.com/browse/RHOAIENG-7887
How Has This Been Tested?
Steps:
0. Create a repository in your image registry, i.e., Quay.io, and make the repository
public
.make image -e IMG=quay.io/<username>/opendatahub-operator:007
Kueue has been deployed.
Same applies for the Training Operator CRDs.
To apply the CRDs for testing:
Screenshot or short clip:
Merge criteria: