-
Notifications
You must be signed in to change notification settings - Fork 149
OCPBUGS-57372: Wait for 2 cp nodes before starting TNF jobs #1431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-57372: Wait for 2 cp nodes before starting TNF jobs #1431
Conversation
|
@slintes: This pull request references Jira Issue OCPBUGS-57372, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/tnf/operator/starter.go
Outdated
| runTnfAuthJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces) | ||
| runTnfAfterSetupJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces) | ||
| // ensure we have both control plane nodes before creating jobs | ||
| nodeList, err := kubeClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's premature optimization, but you can also create a lister on top of the informer indexer:
import corev1listers "k8s.io/client-go/listers/core/v1"
controlPlaneNodeLister := corev1listers.NewNodeLister(controlPlaneNodeInformer.GetIndexer())
so that way you can also save yourself the additional synchronous list call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint with the lister 👍🏼 I don't see though how that can save the synchronous list call, it's just a nicer way to do that call, not?
| klog.Info("not starting TNF jobs yet, waiting for 2 control plane nodes to exist") | ||
| return | ||
| } | ||
| // we can have 2 nodes on the first call of AddFunc already, ensure we create job controllers once only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I've pestered you with retries and idempotency before, but do you think this overall construct is still safe?
Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been drawing up the plans for CP-node replacement. The scope of what we're looking to cover is in-place (i.e. same name/ip replacement).
So my follow up to this is - can we run the setup jobs each team we detect scaling from 1 to 2 control-plane nodes? Even in the ungraceful shutdown case, the second CP node is still part of the cluster (just marked-not-ready).
If we run re-run set up upon discovering 2 nodes, we should be covered for the "straightforward" replacement.
--
As an aside, while we can instruct users to never scale up to 3 nodes in TNF, we can't actually guarantee that they won't. One thing that worries me is if a user scales up to three and then down to 2. If we ran the setup job there, we'd need to destroy the pacemaker cluster in its entirety and rebuild it from nothing because the added member might not be part of the original node list. I know Fabio advised us against doing this, but I'm just pointing out that we should be ready for customers to do it. If this happens, my recommendation is we force the users to scale from 3 down to 1 (where the one node is from the original cluster), and then scale back up to two with the origin IP/name intact. /endsoapbox
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...
I don't see the advantage of polling instead of listening to events... 🤔
With "setup everything at once", do you mean running 1 job only? That's not possible:
- auth needs to run on each node
- setup needs to run on 1 node only
- after-setup needs to run on each node again
Since this is a bugfix for 4.19, node replacement / scaling etc. is out of scope of this PR.
|
As discussed in today's TNF meeting, this at least improves the assisted installer process, so we want to merge and backport this. @clobrano do you mind doing a review? Thanks in advance :) |
clobrano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this is a bugfix for 4.19 and scaling is out of scope, this looks good to me. I left a comment about a possible test case
| want: ClusterConfig{}, | ||
| wantErr: true, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's valuable to add a test where the nodes are 2, but only one as the right label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, done
|
/lgtm not sure if the other threads have been resolved already so |
When being installed with assisted installer, the auth jobs is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs. Signed-off-by: Marc Sluiter <[email protected]>
- use lister - log added and handled node names Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
d4fde97 to
7084f69
Compare
|
Sure :) /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
1 similar comment
|
/retest |
|
/hold cancel @dusk125 may I ask for overrides again? :) |
|
@slintes: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1 similar comment
|
/override ci/prow/e2e-aws-cpms |
|
@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@slintes: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
d5259e4
into
openshift:main
|
@slintes: Jira Issue OCPBUGS-57372: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-57372 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slintes: #1431 failed to apply on top of branch "release-4.19": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[ART PR BUILD NOTIFIER] Distgit: cluster-etcd-operator |
|
Fix included in accepted release 4.20.0-0.nightly-2025-09-08-182033 |
When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.