OCPBUGS-57372: Wait for 2 cp nodes before starting TNF jobs #1431

slintes · 2025-06-13T10:49:11Z

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

openshift-ci-robot · 2025-06-13T10:49:18Z

@slintes: This pull request references Jira Issue OCPBUGS-57372, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pkg/tnf/operator/starter.go

tjungblu · 2025-06-13T11:42:02Z

pkg/tnf/operator/starter.go

-			runTnfAuthJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces)
-			runTnfAfterSetupJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces)
+			// ensure we have both control plane nodes before creating jobs
+			nodeList, err := kubeClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{


I know it's premature optimization, but you can also create a lister on top of the informer indexer:

import corev1listers "k8s.io/client-go/listers/core/v1"
controlPlaneNodeLister := corev1listers.NewNodeLister(controlPlaneNodeInformer.GetIndexer())

so that way you can also save yourself the additional synchronous list call.

Thanks for the hint with the lister 👍🏼 I don't see though how that can save the synchronous list call, it's just a nicer way to do that call, not?

tjungblu · 2025-06-13T11:47:37Z

pkg/tnf/operator/starter.go

+				klog.Info("not starting TNF jobs yet, waiting for 2 control plane nodes to exist")
+				return
+			}
+			// we can have 2 nodes on the first call of AddFunc already, ensure we create job controllers once only


I know I've pestered you with retries and idempotency before, but do you think this overall construct is still safe?

Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...

I've been drawing up the plans for CP-node replacement. The scope of what we're looking to cover is in-place (i.e. same name/ip replacement).

So my follow up to this is - can we run the setup jobs each team we detect scaling from 1 to 2 control-plane nodes? Even in the ungraceful shutdown case, the second CP node is still part of the cluster (just marked-not-ready).

If we run re-run set up upon discovering 2 nodes, we should be covered for the "straightforward" replacement.

--

As an aside, while we can instruct users to never scale up to 3 nodes in TNF, we can't actually guarantee that they won't. One thing that worries me is if a user scales up to three and then down to 2. If we ran the setup job there, we'd need to destroy the pacemaker cluster in its entirety and rebuild it from nothing because the added member might not be part of the original node list. I know Fabio advised us against doing this, but I'm just pointing out that we should be ready for customers to do it. If this happens, my recommendation is we force the users to scale from 3 down to 1 (where the one node is from the original cluster), and then scale back up to two with the origin IP/name intact. /endsoapbox

Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...

I don't see the advantage of polling instead of listening to events... 🤔

With "setup everything at once", do you mean running 1 job only? That's not possible:

auth needs to run on each node

setup needs to run on 1 node only

after-setup needs to run on each node again

Since this is a bugfix for 4.19, node replacement / scaling etc. is out of scope of this PR.

slintes · 2025-06-23T15:31:08Z

As discussed in today's TNF meeting, this at least improves the assisted installer process, so we want to merge and backport this.

@clobrano do you mind doing a review? Thanks in advance :)

clobrano

Considering this is a bugfix for 4.19 and scaling is out of scope, this looks good to me. I left a comment about a possible test case

clobrano · 2025-06-24T07:12:49Z

pkg/tnf/pkg/config/cluster_test.go

+			want:    ClusterConfig{},
+			wantErr: true,
+		},
 	}


Do you think it's valuable to add a test where the nodes are 2, but only one as the right label?

good idea, done

clobrano · 2025-06-24T10:41:31Z

/lgtm

not sure if the other threads have been resolved already so
/hold

When being installed with assisted installer, the auth jobs is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs. Signed-off-by: Marc Sluiter <[email protected]>

- use lister - log added and handled node names Signed-off-by: Marc Sluiter <[email protected]>

Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2025-06-26T07:51:12Z

rebased after #1421 was merged

@clobrano do you mind to lgtm again please? :)

clobrano · 2025-06-26T08:41:18Z

Sure :)

/lgtm

openshift-ci · 2025-06-26T08:42:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/tnf/OWNERS~~ [clobrano,slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2025-06-26T11:58:57Z

/retest

slintes · 2025-06-27T12:12:38Z

/retest

slintes · 2025-06-30T07:08:59Z

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

openshift-cherrypick-robot · 2025-06-30T07:09:02Z

@slintes: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

Details

In response to this:

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-06-30T07:42:16Z

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

openshift-ci-robot · 2025-06-30T10:07:40Z

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

dusk125 · 2025-06-30T13:07:31Z

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

openshift-ci · 2025-06-30T13:09:04Z

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-06-30T20:20:54Z

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

openshift-ci · 2025-07-01T00:11:38Z

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`7084f69`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/configmap-scale	`7084f69`	link	false	`/test configmap-scale`
ci/prow/e2e-aws-etcd-certrotation	`7084f69`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`7084f69`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-metal-ovn-two-node-fencing	`7084f69`	link	false	`/test e2e-metal-ovn-two-node-fencing`
ci/prow/e2e-azure-ovn-etcd-scaling	`7084f69`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-gcp-ovn-etcd-scaling	`7084f69`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-gcp-disruptive-ovn	`7084f69`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-aws-disruptive-ovn	`7084f69`	link	false	`/test e2e-aws-disruptive-ovn`
ci/prow/e2e-gcp-disruptive	`7084f69`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`7084f69`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-aws-etcd-recovery	`7084f69`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-aws-disruptive	`7084f69`	link	false	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-07-01T00:19:45Z

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

openshift-ci-robot · 2025-07-01T03:23:58Z

@slintes: Jira Issue OCPBUGS-57372: All pull requests linked via external trackers have merged:

openshift/cluster-etcd-operator#1431

Jira Issue OCPBUGS-57372 has been moved to the MODIFIED state.

Details

In response to this:

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2025-07-01T03:24:44Z

@slintes: #1431 failed to apply on top of branch "release-4.19":

Applying: TNF: Wait for 2 cp nodes
Using index info to reconstruct a base tree...
M	pkg/tnf/operator/starter.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/tnf/operator/starter.go
CONFLICT (content): Merge conflict in pkg/tnf/operator/starter.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 TNF: Wait for 2 cp nodes

Details

In response to this:

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2025-07-01T08:12:25Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.20.0-202507010150.p0.gd5259e4.assembly.stream.el9.
All builds following this will include this PR.

openshift-merge-robot · 2025-09-19T14:50:17Z

Fix included in accepted release 4.20.0-0.nightly-2025-09-08-182033

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 13, 2025

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 13, 2025

openshift-ci bot requested review from jaypoulz and mshitrit June 13, 2025 10:50

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025

slintes mentioned this pull request Jun 13, 2025

CNTRLPLANE-805: Add TNF pacemaker fencing setup #1421

Merged

5 tasks

tjungblu reviewed Jun 13, 2025

View reviewed changes

pkg/tnf/operator/starter.go Show resolved Hide resolved

tjungblu reviewed Jun 13, 2025

View reviewed changes

clobrano reviewed Jun 24, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025

openshift-ci bot assigned clobrano Jun 24, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2025

slintes added 3 commits June 26, 2025 09:36

Address feedback:

6fae710

- use lister - log added and handled node names Signed-off-by: Marc Sluiter <[email protected]>

Add 1 cp node only test case

7084f69

Signed-off-by: Marc Sluiter <[email protected]>

slintes force-pushed the tnf-wait-2-cp-nodes branch from d4fde97 to 7084f69 Compare June 26, 2025 07:50

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 30, 2025

openshift-merge-bot bot merged commit d5259e4 into openshift:main Jul 1, 2025
20 of 33 checks passed

slintes mentioned this pull request Jul 3, 2025

[release-4.19] OCPBUGS-58339: Wait for 2 cp nodes before starting TNF jobs #1440

Closed

OCPBUGS-57372: Wait for 2 cp nodes before starting TNF jobs #1431

OCPBUGS-57372: Wait for 2 cp nodes before starting TNF jobs #1431

Uh oh!

Conversation

slintes commented Jun 13, 2025

Uh oh!

openshift-ci-robot commented Jun 13, 2025

Uh oh!

Uh oh!

tjungblu Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

tjungblu Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

slintes commented Jun 23, 2025

Uh oh!

clobrano left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano commented Jun 24, 2025

Uh oh!

slintes commented Jun 26, 2025

Uh oh!

clobrano commented Jun 26, 2025

Uh oh!

openshift-ci bot commented Jun 26, 2025

Uh oh!

slintes commented Jun 26, 2025

Uh oh!

slintes commented Jun 27, 2025

Uh oh!

slintes commented Jun 30, 2025

Uh oh!

openshift-cherrypick-robot commented Jun 30, 2025

Uh oh!

openshift-ci-robot commented Jun 30, 2025

Uh oh!

openshift-ci-robot commented Jun 30, 2025

Uh oh!

dusk125 commented Jun 30, 2025

Uh oh!

openshift-ci bot commented Jun 30, 2025

Uh oh!

openshift-ci-robot commented Jun 30, 2025

Uh oh!

openshift-ci bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 1, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 1, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 1, 2025

Uh oh!

openshift-bot commented Jul 1, 2025

Uh oh!

openshift-merge-robot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

openshift-ci bot commented Jul 1, 2025 •

edited

Loading