Skip to content

Conversation

@slintes
Copy link
Member

@slintes slintes commented May 6, 2025

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2025
@slintes slintes force-pushed the tnf-add-fencing branch from d6e0d8b to 92cbed1 Compare May 7, 2025 15:17
@slintes slintes changed the title [WIP] Add TNF pacemaker fencing setup [WIP] CNTRLPLANE-805: Add TNF pacemaker fencing setup May 16, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 16, 2025

@slintes: This pull request references CNTRLPLANE-805 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 16, 2025

@slintes: This pull request references CNTRLPLANE-805 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 21, 2025

@slintes: This pull request references CNTRLPLANE-805 which is a valid jira issue.

Details

In response to this:

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@slintes slintes force-pushed the tnf-add-fencing branch from 92cbed1 to 6cebd6d Compare May 21, 2025 15:21
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 21, 2025

@slintes: This pull request references CNTRLPLANE-805 which is a valid jira issue.

Details

In response to this:

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@slintes slintes changed the title [WIP] CNTRLPLANE-805: Add TNF pacemaker fencing setup CNTRLPLANE-805: Add TNF pacemaker fencing setup May 21, 2025
@slintes slintes force-pushed the tnf-add-fencing branch from 6cebd6d to 4e7c5b9 Compare June 4, 2025 11:07
@slintes slintes force-pushed the tnf-add-fencing branch from 4e7c5b9 to 3dee9a6 Compare June 4, 2025 12:27
@slintes slintes marked this pull request as ready for review June 4, 2025 12:28
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 4, 2025

@slintes: This pull request references CNTRLPLANE-805 which is a valid jira issue.

Details

In response to this:

This PR adds pacemaker fencing setup:

  • create new fencing job, which waits for the setup job to finish
  • read fencing configuration from secrets
  • validate configs with status command
  • configure STONITH devices
  • in case of any error, disable stonith

Blocked by:
openshift/installer#9635

and for testing:
metal3-io/metal3-dev-env#1528
openshift-metal3/dev-scripts#1751
openshift-metal3/dev-scripts#1757

TODO

  • redact password in logs
  • reconfigure fencing on secret change
  • test with updated installer which creates secrets
  • delete existing stonith devices on start and error
  • do a ungraceful shutdown test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from clobrano and jaypoulz June 4, 2025 12:31
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a PR! :)
I added a couple of minor notes, and I will have a look again tomorrow, but it looks good to me


const replaceWith = "<REDACTED>"

var (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

func handleFencingSecretChange(ctx context.Context, client kubernetes.Interface, oldObj, obj interface{}) {
secret, ok := obj.(*corev1.Secret)
if !ok {
klog.Warningf("failed to convert added / modifed / deleted object to Secret %+v", obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo s/modifed/modified

Comment on lines 330 to 331
klog.Infof("fencing job not found, skipping recreation")
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a little to understand this because the function name isFencingJobRunning and the returned boolean done are kind of the opposite :D.
isFencingJobRunning returns true when is not running.
Not fundamental, but might we consider changing the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, no clue why I named it this way 🤔 🤣

- fixed typo
- renamed and clarified job wait function

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Jun 13, 2025

Feedback addressed

I prefer to wait for #1430 (and maybe #1431) and do a rebase, in order to prevent merge conflicts when backporting the latter...

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2025
@clobrano
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member Author

slintes commented Jun 22, 2025

let's not wait longer for #1431
/hold cancel

@dusk125 @sandeepknd @tjungblu
Do you want to review? And / or do you mind to override the failing tests please? Thanks!

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b45f392 and 2 for PR HEAD 707f0ae in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b45f392 and 2 for PR HEAD 707f0ae in total

@dusk125
Copy link
Contributor

dusk125 commented Jun 23, 2025

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2025

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f3dc708 and 1 for PR HEAD 707f0ae in total

@slintes
Copy link
Member Author

slintes commented Jun 24, 2025

@dusk125 can you override ci/prow/e2e-aws-cpms again please, somehow that didn't last long 🤷🏼‍♂️

@dusk125
Copy link
Contributor

dusk125 commented Jun 24, 2025

/override ci/prow/e2e-aws-cpms

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2025

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-cpms

Details

In response to this:

/override ci/prow/e2e-aws-cpms

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slintes
Copy link
Member Author

slintes commented Jun 24, 2025

and now ci/prow/e2e-aws-ovn-etcd-scaling is running again... :/

@dusk125
Copy link
Contributor

dusk125 commented Jun 24, 2025

/override ci/prow/e2e-azure-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2025

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-azure-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-azure-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slintes
Copy link
Member Author

slintes commented Jun 24, 2025

/test ci/prow/unit

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2025

@slintes: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-agnostic-ovn
/test e2e-agnostic-ovn-upgrade
/test e2e-aws-cpms
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-serial
/test e2e-aws-ovn-single-node
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-ipv6
/test e2e-operator
/test images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test configmap-scale
/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-disruptive-ovn
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-gcp
/test e2e-gcp-disruptive
/test e2e-gcp-disruptive-ovn
/test e2e-gcp-ovn-etcd-scaling
/test e2e-metal-ovn-ha-cert-rotation-shutdown
/test e2e-metal-ovn-sno-cert-rotation-shutdown
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-single-node-live-iso
/test e2e-operator-fips
/test e2e-vsphere-ovn-etcd-scaling
/test okd-scos-e2e-aws-ovn
/test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-etcd-operator-main-configmap-scale
pull-ci-openshift-cluster-etcd-operator-main-e2e-agnostic-ovn
pull-ci-openshift-cluster-etcd-operator-main-e2e-agnostic-ovn-upgrade
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-cpms
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-disruptive
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-disruptive-ovn
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-etcd-certrotation
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-etcd-recovery
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-ovn-etcd-scaling
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-ovn-serial
pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-ovn-single-node
pull-ci-openshift-cluster-etcd-operator-main-e2e-azure
pull-ci-openshift-cluster-etcd-operator-main-e2e-azure-ovn-etcd-scaling
pull-ci-openshift-cluster-etcd-operator-main-e2e-gcp
pull-ci-openshift-cluster-etcd-operator-main-e2e-gcp-disruptive
pull-ci-openshift-cluster-etcd-operator-main-e2e-gcp-disruptive-ovn
pull-ci-openshift-cluster-etcd-operator-main-e2e-gcp-ovn-etcd-scaling
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-assisted
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-ovn-ha-cert-rotation-shutdown
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-ovn-sno-cert-rotation-shutdown
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-ovn-two-node-fencing
pull-ci-openshift-cluster-etcd-operator-main-e2e-metal-single-node-live-iso
pull-ci-openshift-cluster-etcd-operator-main-e2e-operator
pull-ci-openshift-cluster-etcd-operator-main-e2e-operator-fips
pull-ci-openshift-cluster-etcd-operator-main-e2e-vsphere-ovn-etcd-scaling
pull-ci-openshift-cluster-etcd-operator-main-images
pull-ci-openshift-cluster-etcd-operator-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-cluster-etcd-operator-main-unit
pull-ci-openshift-cluster-etcd-operator-main-verify
pull-ci-openshift-cluster-etcd-operator-main-verify-deps
Details

In response to this:

/test ci/prow/unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slintes
Copy link
Member Author

slintes commented Jun 24, 2025

/test unit

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f3dc708 and 2 for PR HEAD 707f0ae in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f3dc708 and 2 for PR HEAD 707f0ae in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2025

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-disruptive-ovn 707f0ae link false /test e2e-gcp-disruptive-ovn
ci/prow/okd-scos-e2e-aws-ovn 707f0ae link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-vsphere-ovn-etcd-scaling 707f0ae link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-two-node-fencing 707f0ae link false /test e2e-metal-ovn-two-node-fencing
ci/prow/configmap-scale 707f0ae link false /test configmap-scale
ci/prow/e2e-aws-etcd-certrotation 707f0ae link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 707f0ae link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-gcp-ovn-etcd-scaling 707f0ae link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-disruptive-ovn 707f0ae link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-etcd-recovery 707f0ae link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 707f0ae link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-gcp-disruptive 707f0ae link false /test e2e-gcp-disruptive
ci/prow/e2e-aws-disruptive 707f0ae link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f3dc708 and 2 for PR HEAD 707f0ae in total

@dusk125
Copy link
Contributor

dusk125 commented Jun 25, 2025

/override ci/prow/e2e-aws-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2025

@dusk125: Overrode contexts on behalf of dusk125: ci/prow/e2e-aws-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit e1880b7 into openshift:main Jun 25, 2025
20 of 33 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.20.0-202506251513.p0.ge1880b7.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants