Skip to content

Conversation

@slintes
Copy link
Member

@slintes slintes commented Jun 13, 2025

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 13, 2025
@openshift-ci-robot
Copy link

@slintes: This pull request references Jira Issue OCPBUGS-57372, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 13, 2025
@openshift-ci openshift-ci bot requested review from jaypoulz and mshitrit June 13, 2025 10:50
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
runTnfAuthJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces)
runTnfAfterSetupJobController(ctx, node.GetName(), controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces)
// ensure we have both control plane nodes before creating jobs
nodeList, err := kubeClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's premature optimization, but you can also create a lister on top of the informer indexer:

import corev1listers "k8s.io/client-go/listers/core/v1"
controlPlaneNodeLister := corev1listers.NewNodeLister(controlPlaneNodeInformer.GetIndexer())

so that way you can also save yourself the additional synchronous list call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint with the lister 👍🏼 I don't see though how that can save the synchronous list call, it's just a nicer way to do that call, not?

klog.Info("not starting TNF jobs yet, waiting for 2 control plane nodes to exist")
return
}
// we can have 2 nodes on the first call of AddFunc already, ensure we create job controllers once only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've pestered you with retries and idempotency before, but do you think this overall construct is still safe?

Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been drawing up the plans for CP-node replacement. The scope of what we're looking to cover is in-place (i.e. same name/ip replacement).

So my follow up to this is - can we run the setup jobs each team we detect scaling from 1 to 2 control-plane nodes? Even in the ungraceful shutdown case, the second CP node is still part of the cluster (just marked-not-ready).

If we run re-run set up upon discovering 2 nodes, we should be covered for the "straightforward" replacement.

--

As an aside, while we can instruct users to never scale up to 3 nodes in TNF, we can't actually guarantee that they won't. One thing that worries me is if a user scales up to three and then down to 2. If we ran the setup job there, we'd need to destroy the pacemaker cluster in its entirety and rebuild it from nothing because the added member might not be part of the original node list. I know Fabio advised us against doing this, but I'm just pointing out that we should be ready for customers to do it. If this happens, my recommendation is we force the users to scale from 3 down to 1 (where the one node is from the original cluster), and then scale back up to two with the origin IP/name intact. /endsoapbox

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just poll until the list informer has two entries and then just setup everything at once. Might simplify this a lot...

I don't see the advantage of polling instead of listening to events... 🤔

With "setup everything at once", do you mean running 1 job only? That's not possible:

  • auth needs to run on each node
  • setup needs to run on 1 node only
  • after-setup needs to run on each node again

Since this is a bugfix for 4.19, node replacement / scaling etc. is out of scope of this PR.

@slintes
Copy link
Member Author

slintes commented Jun 23, 2025

As discussed in today's TNF meeting, this at least improves the assisted installer process, so we want to merge and backport this.

@clobrano do you mind doing a review? Thanks in advance :)

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is a bugfix for 4.19 and scaling is out of scope, this looks good to me. I left a comment about a possible test case

want: ClusterConfig{},
wantErr: true,
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's valuable to add a test where the nodes are 2, but only one as the right label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done

@clobrano
Copy link
Contributor

/lgtm

not sure if the other threads have been resolved already so
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2025
slintes added 3 commits June 26, 2025 09:36
When being installed with assisted installer, the auth jobs is already
created when 1 cp node is available only. Wait with the job creation
until both nodes exist, ensure jobs are created once only, and double
check node count inside jobs.

Signed-off-by: Marc Sluiter <[email protected]>
- use lister
- log added and handled node names

Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
@slintes slintes force-pushed the tnf-wait-2-cp-nodes branch from d4fde97 to 7084f69 Compare June 26, 2025 07:50
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2025
@slintes
Copy link
Member Author

slintes commented Jun 26, 2025

rebased after #1421 was merged

@clobrano do you mind to lgtm again please? :)

@clobrano
Copy link
Contributor

Sure :)

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member Author

slintes commented Jun 26, 2025

/retest

1 similar comment
@slintes
Copy link
Member Author

slintes commented Jun 27, 2025

/retest

@slintes
Copy link
Member Author

slintes commented Jun 30, 2025

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

@openshift-cherrypick-robot

@slintes: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

Details

In response to this:

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 30, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

@dusk125
Copy link
Contributor

dusk125 commented Jun 30, 2025

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 30, 2025

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 1, 2025

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 7084f69 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/configmap-scale 7084f69 link false /test configmap-scale
ci/prow/e2e-aws-etcd-certrotation 7084f69 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 7084f69 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-two-node-fencing 7084f69 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-azure-ovn-etcd-scaling 7084f69 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling 7084f69 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive-ovn 7084f69 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-aws-disruptive-ovn 7084f69 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-gcp-disruptive 7084f69 link false /test e2e-gcp-disruptive
ci/prow/e2e-vsphere-ovn-etcd-scaling 7084f69 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-aws-etcd-recovery 7084f69 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-disruptive 7084f69 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4d2134b and 2 for PR HEAD 7084f69 in total

@openshift-merge-bot openshift-merge-bot bot merged commit d5259e4 into openshift:main Jul 1, 2025
20 of 33 checks passed
@openshift-ci-robot
Copy link

@slintes: Jira Issue OCPBUGS-57372: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-57372 has been moved to the MODIFIED state.

Details

In response to this:

When being installed with assisted installer, the 1st auth job is already created when 1 cp node is available only. Wait with the job creation until both nodes exist, ensure jobs are created once only, and double check node count inside jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@slintes: #1431 failed to apply on top of branch "release-4.19":

Applying: TNF: Wait for 2 cp nodes
Using index info to reconstruct a base tree...
M	pkg/tnf/operator/starter.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/tnf/operator/starter.go
CONFLICT (content): Merge conflict in pkg/tnf/operator/starter.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 TNF: Wait for 2 cp nodes

Details

In response to this:

/hold cancel
/cherry-pick release-4.19

@dusk125 may I ask for overrides again? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.20.0-202507010150.p0.gd5259e4.assembly.stream.el9.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.20.0-0.nightly-2025-09-08-182033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants