Skip to content

CNF-17181: Extend Core reference configuration with NHC/SNR operators#659

Open
rdiscala wants to merge 1 commit intoopenshift-kni:mainfrom
rdiscala:CNF-17181-add-nhc-snr-operators
Open

CNF-17181: Extend Core reference configuration with NHC/SNR operators#659
rdiscala wants to merge 1 commit intoopenshift-kni:mainfrom
rdiscala:CNF-17181-add-nhc-snr-operators

Conversation

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

@rdiscala: This pull request references CNF-17181 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Related tests:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from MarSik and sabbir-47 March 20, 2026 22:17
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rdiscala
Once this PR has been reviewed and has the lgtm label, please assign irinamihai for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +30 to +36
unhealthyConditions:
- duration: $duration # eg 60s
status: 'False'
type: Ready
- duration: $duration # eg 60s
status: 'Unknown'
type: Ready
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these logically "AND" or "OR"? (would be helpful to add this to the note)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is OR. See this for loop in conditions.go. I'll add this information in the comment.

# If the node cannot reach the API server within this time, it is
# considered to have lost connectivity.
# Tested value: 15s
apiServerTimeout: $apiServerTimeout # eg 15s
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of configuration values here to set the various timeouts. Between the testing we have done (plus test lanes we put in place) and recommendations from the NHC/SNR team are we sufficiently confident with these values that we can set the tested values here as defaults and then provide guidance in the comments and RDS doc for any values which they may need to udpate? Without defaults here there is a significant amount of per-partner/customer tuning which is required.
(same here and in NodeHealthCheck config)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The values provided here as comments are the same ones that were provided in CNF-17181. These were are also used successfully in the TCFE, using this playbook (the only difference is minHealthy set to 75% as the test was conducted on a cluster with 4 workers).

I've asked in the #forum-dragonfly for an additional review.

I can amend this file and define the setting values based on the comments.

@jnunyez: thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values for CNF-17181 were used in a specific setup but they could be different in another environment with for instance a different number of nodes. It would be beneficial to hear feedback from #forum-dragonfly to understand the relevant trade-offs for these parameters and give proper recommendation in RDS documentation.

Comment on lines +116 to +118
- path: reference-crs/required/node-health-check/SelfNodeRemediationTemplate.yaml
- path: reference-crs/required/node-health-check/SelfNodeRemediationConfig.yaml
- path: reference-crs/required/node-health-check/NodeHealthCheck.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any required patches should be included here as well. Ideally we will have good default values in the reference-crs and won't need any (or at least relatively few) patches here that the user would update.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add this statement as a comment?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my intent was only that anything in the reference-crs above which requires user input (noted with field: $value) should be included here as a patch with the reference value. Ideally the defaults are considered correct for most use cases and we don't need anything here (the values are all fully defaulted in the base CRs). If there is a common update made here an example patch can be included but commented out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any additional patches we need to reference here.

@rdiscala rdiscala force-pushed the CNF-17181-add-nhc-snr-operators branch from b678aac to 3ba131b Compare March 24, 2026 17:47
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 24, 2026

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Signed-off-by: Rigel Di Scala <rdiscala@redhat.com>
@rdiscala rdiscala force-pushed the CNF-17181-add-nhc-snr-operators branch from 3ba131b to 0e850e2 Compare April 27, 2026 08:40
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 98e9e391-8d6f-47f9-a52e-1d9951f3afb3

📥 Commits

Reviewing files that changed from the base of the PR and between 94601bc and 0e850e2.

📒 Files selected for processing (16)
  • telco-core/configuration/core-baseline.yaml
  • telco-core/configuration/reference-crs-kube-compare/metadata.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/NHCSubscription.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/NHCSubscriptionNS.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/NHCSubscriptionOperGroup.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/NodeHealthCheck.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/SNRSubscription.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/SelfNodeRemediationConfig.yaml
  • telco-core/configuration/reference-crs-kube-compare/required/node-health-check/SelfNodeRemediationTemplate.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/NHCSubscription.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/NHCSubscriptionNS.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/NHCSubscriptionOperGroup.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/NodeHealthCheck.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/SNRSubscription.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/SelfNodeRemediationConfig.yaml
  • telco-core/configuration/reference-crs/required/node-health-check/SelfNodeRemediationTemplate.yaml

📝 Walkthrough

Walkthrough

Adds Node Health Check and Self Node Remediation operator deployment and configuration manifests across telco-core policies and reference resources. Two sets of manifests are provided: templated versions for kube-compare with dynamic spec rendering, and fully configured versions for reference-crs with explicit specifications for health monitoring and node remediation.

Changes

Cohort / File(s) Summary
Core Baseline Policy Updates
telco-core/configuration/core-baseline.yaml
Extends OLM subscription policies to include Node Health Check and Self Node Remediation subscriptions with manual install plan approval configured.
Kube-Compare Configuration
telco-core/configuration/reference-crs-kube-compare/metadata.yaml
Adds node-health-check comparison part with allOrNoneOf enforcement requiring related Node Health Check and Self Node Remediation resources to be present together.
Kube-Compare Node Health Check Manifests
telco-core/configuration/reference-crs-kube-compare/required/node-health-check/*.yaml
Introduces Helm-templated Kubernetes manifests for Node Health Check operator subscription, namespace, operator group, health check policy, Self Node Remediation subscription, configuration, and remediation template. Specs are dynamically rendered from .spec with empty dict defaults.
Reference CRS Node Health Check Manifests
telco-core/configuration/reference-crs/required/node-health-check/*.yaml
Adds fully configured Kubernetes manifests for Node Health Check operator setup including subscriptions (redhat-operators-disconnected source), namespace, operator group, and detailed NodeHealthCheck policy targeting worker nodes. Includes comprehensive SelfNodeRemediationConfig with reboot, watchdog, API timeout, and peer connectivity parameters, plus remediation template with OutOfServiceTaint strategy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: extending the Core reference configuration with Node Health Check and Self Node Remediation operators.
Description check ✅ Passed The description is related to the changeset by referencing related tests that validate the addition of NHC and SNR operators to the Core reference configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants