Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart Kueue if new CRDs become available #1058

Conversation

ChristianZaccaria
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria commented Jun 13, 2024

Many thanks for submitting your Pull Request ❤️!

Please make sure that your PR meets the following requirements:

  • You have read the contributors guide
  • Pull Request contains description of the issue
  • Pull Request contains link to the JIRA issue
  • Pull Request contains link to any dependent or related Pull Request

Description

  • Checking for the availability of all Ray OR Training Operator CRDs. If available, the Kueue pod is restarted, and a new condition is set in the DSC.
  • Vice-versa, if the CRD/s is removed, the condition is changed and Kueue continues to function.
  • Kueue will only restart when all the CRDs of Ray OR Training Operator are available.

Kueue can function without those CRDs, however it requires to be restarted to work with the other components such as KubeRay and Training Operator.

JIRA issue:

Jira: https://issues.redhat.com/browse/RHOAIENG-7887

How Has This Been Tested?

Steps:
0. Create a repository in your image registry, i.e., Quay.io, and make the repository public.

  1. Build-Push image to registry: make image -e IMG=quay.io/<username>/opendatahub-operator:007
  2. make deploy -e IMG=quay.io//opendatahub-operator:007 -e OPERATOR_NAMESPACE=opendatahub-operator-system
  3. Remove i.e., the RayCluster (Ray) and MPIJob (Training Operator) CRDs.
  4. Create the DSCI:
kind: DSCInitialization
apiVersion: dscinitialization.opendatahub.io/v1
metadata:
  name: default-dsci
spec:
  devFlags:
    logmode: development
  applicationsNamespace: opendatahub-operator-system
  monitoring:
    managementState: Managed
    namespace: opendatahub-operator-system
  serviceMesh:
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed
  trustedCABundle:
    customCABundle: ''
    managementState: Managed
  1. Create the DSC:
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Managed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    kserve:
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    modelregistry:
      managementState: Managed
    ray:
      managementState: Removed
    trainingoperator:
      managementState: Removed
    trustyai:
      managementState: Managed
    workbenches:
      managementState: Managed
  1. There are 3 scenarios that can be tested here:
    Kueue has been deployed.
  • RayCluster CRD not available - Condition in DSC should be false.
  • RayCluster CRD becomes available - Kueue should restart and condition in DSC should be true.
  • RayCluster CRD initially available, then removed - Condition in DSC should be false and Kueue should not restart unless the CRD becomes available again.

Same applies for the Training Operator CRDs.

To apply the CRDs for testing:

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/crd/bases/ray.io_rayclusters.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/manifests/base/crds/kubeflow.org_mpijobs.yaml 

Screenshot or short clip:

image

Merge criteria:

  • Have a meaningful commit messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot requested review from AjayJagan and zdtsw June 13, 2024 10:35
Copy link

openshift-ci bot commented Jun 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zdtsw for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zdtsw zdtsw requested a review from VaishnaviHire June 13, 2024 10:59
@zdtsw
Copy link
Member

zdtsw commented Jun 13, 2024

@ChristianZaccaria i am not sure why this solution is not done in kueue instead of moving it in Operator?

@ChristianZaccaria
Copy link
Contributor Author

@ChristianZaccaria i am not sure why this solution is not done in kueue instead of moving it in Operator?

Do you mean in the Kueue repository?
If yes, I was suggested to include this in this repository as something similar is done with KServe. However, Kueue is not dependant on managed components. So, I think you have a point, I don't see why this couldn't work in Kueue directly now that you mention it. I'm open to thoughts on this. @VaishnaviHire @sutaakar

Ultimately, we are deleting the Kueue pod when Ray OR Training Operator CRDs are available, of which then the DSC brings the Kueue pod back.

@zdtsw
Copy link
Member

zdtsw commented Jun 13, 2024

@ChristianZaccaria i am not sure why this solution is not done in kueue instead of moving it in Operator?

Do you mean in the Kueue repository? If yes, I was suggested to include this in this repository as something similar is done with KServe. However, Kueue is not dependant on managed components. So, I think you have a point, I don't see why this couldn't work in Kueue directly now that you mention it. I'm open to thoughts on this. @VaishnaviHire @sutaakar

Ultimately, we are deleting the Kueue pod when Ray OR Training Operator CRDs are available, of which then the DSC brings the Kueue pod back.

yes, i was thinking Kueue has its controller then it should have some Watches(). function already. It gets easier if they can self-maintain these resources.
i think Kueue is not depend on other components but rather these CRDs which originally applied by other components.
In ODH Operator, it can be done as you suggested, but I would rather not build tight coupling between Operator and any component.
Another reason is, if every component need certain way to check extra resources and adding these customized logic into Operator it, eventually it gets Operator code hard to maintain.

The main reason Kserve has such check to servicemesh serverless and authroino is because these are dependent operators and will be used as common platform resources. At the moment, only Kserve is using such but later on other components will rely on some of these as well.

One more thing, if there is no dependency between kueue and TO or Ray, should you reconsider how these CRD should be applied?
Unless it is clear to user that to use kueue you must enabled ComponetX, user can opt-out X but only have kueue enabled.

@ChristianZaccaria
Copy link
Contributor Author

ChristianZaccaria commented Jun 14, 2024

@zdtsw that makes sense to me. I'll attempt this solution on Kueue directly, just thinking if I should bring this to upstream Kueue or only on downstream repos.

One more thing, if there is no dependency between kueue and TO or Ray, should you reconsider how these CRD should be applied?
Unless it is clear to user that to use kueue you must enabled ComponetX, user can opt-out X but only have kueue enabled.

I think this is clear to the user, moreover, these components are set to Managed by default (eventually the TO as well).

@zdtsw
Copy link
Member

zdtsw commented Jun 14, 2024

@zdtsw that makes sense to me. I'll attempt this solution on Kueue directly, just thinking if I should bring this to upstream Kueue or only on downstream repos.

One more thing, if there is no dependency between kueue and TO or Ray, should you reconsider how these CRD should be applied?
Unless it is clear to user that to use kueue you must enabled ComponetX, user can opt-out X but only have kueue enabled.

I think this is clear to the user, moreover, these components are set to Managed by default (eventually the TO as well).

sure, by default we set these are Managed because they are GA component.
But user can choose to config certain back to Removed, or if they do not use default way to create DSC, they can opt-out some. If the solution provided now is based on assumption, i think this need to be clear in document.

@sutaakar
Copy link

@ChristianZaccaria it should be ideally addressed in community.
One idea for a temporary workaround (until resolved in community) could be to have a sidecar monitoring CRDs, restarting operator pod in case watched CRD becomes available.

@ChristianZaccaria
Copy link
Contributor Author

@zdtsw Thanks Wen, you're right. We'll have this clearly documented for the users.

@ChristianZaccaria
Copy link
Contributor Author

@sutaakar what do you think of re-enabling this PR: opendatahub-io/kueue#33 but for Community.

I'm not sure if the community will agree on restarting the Kueue pod. So, I think I can rework the mentioned PR to start the decoupled controller and webhook on CRD availability, and restart them on new CRDs becoming available. - Similarly as done on this PR, except Kueue pod doesn't restart, but the job framework controller and webhook do. WDYT?

@sutaakar
Copy link

@ChristianZaccaria I think it is worth opening an issue in Kueue GitHub, suggesting there this approach. Lets see what they think about it. WDYT?

@zdtsw
Copy link
Member

zdtsw commented Jun 14, 2024

@ChristianZaccaria if we do not need this PR, can we close it?

@ChristianZaccaria
Copy link
Contributor Author

ChristianZaccaria commented Jun 14, 2024

Closing: Solution to be implemented upstream in Kueue community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants