Skip to content

CNF-17181 Add a test: NHC and SNR operators perform no remediation on upgrade#1278

Draft
rdiscala wants to merge 3 commits intorh-ecosystem-edge:mainfrom
rdiscala:CNF-17181-nhc-snr-no-remediation-on-upgrade-test
Draft

CNF-17181 Add a test: NHC and SNR operators perform no remediation on upgrade#1278
rdiscala wants to merge 3 commits intorh-ecosystem-edge:mainfrom
rdiscala:CNF-17181-nhc-snr-no-remediation-on-upgrade-test

Conversation

@rdiscala
Copy link
Copy Markdown
Collaborator

@rdiscala rdiscala commented Mar 20, 2026

rhwa nhc: add planned-reboot-during-upgrade system test

Verify that the NHC operator does NOT trigger remediation when
a worker node reboots as part of a planned OCP cluster upgrade.
The test deploys a stateful app, initiates a cluster upgrade,
and polls throughout the upgrade to confirm no SelfNodeRemediation
resources are created. Post-upgrade checks verify all nodes are
healthy, no out-of-service taints remain, and the app survived.

Related: ECOPROJECT-2283

PR depends on #1272

Summary by CodeRabbit

  • Tests

    • Added comprehensive end-to-end suites for NHC and SNR: operator readiness, post-deployment checks, "sudden node loss" (BMC-driven fencing/failover) and "planned node reboot" (upgrade) scenarios, app/PVC rescheduling and recovery, and JUnit/XML reporting hooks.
  • Documentation

    • Added a README describing scenarios, prerequisites, configuration, run commands, and expected runtimes.
  • Chores

    • Introduced default test configuration fields, BMC credentials handling, and runtime tunables (target/failover workers, storage class, app/upgrade images, timeouts).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ad2fb155-3406-47b7-841e-af80ff16dc21

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds explicit NHC configuration defaults and BMC decode support, new NHC/SNR parameters and reporting settings, Ginkgo suite entrypoints and reporting hooks, comprehensive NHC E2E tests (sudden-node-loss, planned-node-reboot), SNR tests, and a README describing scenarios and prerequisites.

Changes

Cohort / File(s) Summary
Configuration
tests/rhwa/internal/rhwaconfig/default.yaml, tests/rhwa/internal/rhwaconfig/rhwaconfig.go
Replaced YAML terminator with explicit NHC defaults (nhc_target_worker, nhc_failover_workers, nhc_storage_class, nhc_app_image, nhc_target_worker_bmc). Added BMCDetails type with a Decode method and extended RHWAConfig with fields for target/failover workers, storage/image, BMC details, and upgrade parameters.
NHC Parameters
tests/rhwa/nhc-operator/internal/nhcparams/const.go, tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
Added scenario labels, resource/deployment identifiers, app/PVC/taint constants, operator pod label, reporter namespaces/CRDs, and multiple exported time.Duration tuning variables (timeouts, polling, BMC/upgrade).
NHC Test Infra & Reporting
tests/rhwa/nhc-operator/nhc_suite_test.go, tests/rhwa/nhc-operator/tests/nhc.go
Added Ginkgo suite entrypoint with JUnit/report hooks, JustAfterEach failure reporting, ReportAfterSuite XML output, and a post-deployment readiness test ensuring NHC deployment and controller pod readiness.
NHC E2E Scenarios
tests/rhwa/nhc-operator/tests/sudden-node-loss.go, tests/rhwa/nhc-operator/tests/planned-node-reboot.go
Introduced two large Ordered E2E suites: sudden-node-loss (BMC power control, NHC→SNR remediation flow, fencing/taint checks, PVC/VolumeAttachment and pod reschedule verification) and planned-node-reboot (upgrade-driven scenario asserting no SNR remediation and validating cluster/operator/app state). Includes node-label management and local helpers.
NHC Documentation
tests/rhwa/nhc-operator/README.md
Added README describing both test scenarios, control flows, prerequisites (cluster topology, BMC/Redfish), required config/env vars, run commands, and expected runtimes.
SNR Parameters & Tests
tests/rhwa/snr-operator/internal/snrparams/snrvars.go, tests/rhwa/snr-operator/snr_suite_test.go, tests/rhwa/snr-operator/tests/snr.go
Added SNR operator identifiers and controller pod label, reporter namespace/CRD config, Ginkgo suite entrypoint with reporting hooks, and a post-deployment pod readiness verification test.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a test to verify NHC and SNR operators perform no remediation during cluster upgrades.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rdiscala rdiscala changed the title Cnf 17181 nhc snr no remediation on upgrade test CNF-17181 Add a test: NHC and SNR operators perform no remediation on upgrade Mar 20, 2026
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
tests/rhwa/internal/rhwaconfig/default.yaml (1)

4-11: Add upgrade defaults to keep config template complete.

RHWAConfig now supports upgrade inputs, but this default file does not expose them. Adding these keys improves operability for the planned-reboot-during-upgrade flow.

♻️ Suggested default.yaml update
 nhc_target_worker: ""       # e.g. "openshift-worker-0.ocp.example.org"
 nhc_failover_workers: []    # e.g. ["openshift-worker-1.ocp.example.org"]
 nhc_storage_class: ""       # e.g. "ocs-external-storagecluster-ceph-rbd"
 nhc_app_image: ""           # e.g. "registry.ocp.example.org:5000/test/ubi-minimal:latest"
+nhc_upgrade_image: ""       # e.g. "quay.io/openshift-release-dev/ocp-release:4.19.0-x86_64"
+nhc_upgrade_channel: ""     # e.g. "stable-4.19"
 nhc_target_worker_bmc:
   address: ""               # e.g. "10.1.0.2"
   username: ""              # e.g. "example-user"
   password: ""              # e.g. "example-pass"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/internal/rhwaconfig/default.yaml` around lines 4 - 11, The default
YAML is missing the RHWAConfig upgrade-related keys; update the config template
to include the upgrade inputs that RHWAConfig now expects (e.g., add default
keys such as upgrade_enabled, upgrade_target_version,
planned_reboot_during_upgrade and any other RHWAConfig upgrade_* fields) with
sensible empty/false defaults so the planned-reboot-during-upgrade flow is
configurable; ensure the keys match the RHWAConfig property names exactly so
code parsing functions (e.g., the RHWAConfig loader/validator) will recognize
them.
tests/rhwa/nhc-operator/README.md (1)

92-113: Document the planned-reboot-during-upgrade suite and its inputs.

This README currently documents only sudden-node-loss flow/inputs, while this PR also introduces planned reboot during upgrade. Please add the new suite entry and upgrade env vars for completeness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/README.md` around lines 92 - 113, Add a new
test-suite row for planned-reboot-during-upgrade (pointing to
tests/planned-reboot-during-upgrade.go) and document the new environment
variables used by that flow; specifically list ECO_RHWA_NHC_UPGRADE_IMAGE,
ECO_RHWA_NHC_UPGRADE_REPO and ECO_RHWA_NHC_UPGRADE_TARGETS (or the exact
upgrade-related env var names introduced in this PR) with short descriptions of
their meanings (upgrade image/repo and which nodes to reboot) so the README’s
"Test suites" and "Inputs" sections match the new planned-reboot-during-upgrade
implementation and the constants in internal/nhcparams/const.go.
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (3)

447-460: Absolute minHealthy values are not validated.

The code only validates minHealthy when it's a percentage (ends with %). If minHealthy is specified as an absolute number (e.g., "2"), the validation is silently skipped. Consider adding handling for absolute values.

♻️ Handle absolute minHealthy values
 			if pctStr, ok := strings.CutSuffix(minHealthyStr, "%"); ok {
 				pct, parseErr := strconv.ParseFloat(pctStr, 64)
 				Expect(parseErr).ToNot(HaveOccurred(), "Failed to parse minHealthy percentage")

 				observed, observedErr := toFloat64(observedNodes)
 				Expect(observedErr).ToNot(HaveOccurred(), "Failed to convert observedNodes to float64")
 				healthy, healthyErr := toFloat64(healthyNodes)
 				Expect(healthyErr).ToNot(HaveOccurred(), "Failed to convert healthyNodes to float64")
 				minRequired := math.Ceil(observed * pct / 100)

 				Expect(healthy).To(BeNumerically(">=", minRequired),
 					fmt.Sprintf("minHealthy threshold not satisfied: %v/%v healthy, minimum %v required",
 						healthy, observed, minRequired))
+			} else {
+				// Absolute number case
+				minRequired, parseErr := strconv.ParseFloat(minHealthyStr, 64)
+				if parseErr == nil {
+					healthy, healthyErr := toFloat64(healthyNodes)
+					Expect(healthyErr).ToNot(HaveOccurred(), "Failed to convert healthyNodes to float64")
+
+					Expect(healthy).To(BeNumerically(">=", minRequired),
+						fmt.Sprintf("minHealthy threshold not satisfied: %v healthy, minimum %v required",
+							healthy, minRequired))
+				}
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 447 - 460,
The current block only handles percentage minHealthy values (minHealthyStr
ending with "%"); add an else branch to validate absolute minHealthy numbers:
when strings.CutSuffix(minHealthyStr, "%") returns ok==false, parse
minHealthyStr as a float (reuse strconv.ParseFloat) into an absoluteMin value,
compute minRequired := math.Ceil(absoluteMin) (or just absoluteMin if you prefer
integer semantics), and then perform the same conversions of
observedNodes/healthyNodes via toFloat64 and the Expect check comparing healthy
>= minRequired (use the same error message formatting). Update the code around
minHealthyStr, pctStr, toFloat64, and minRequired to support both percentage and
absolute forms.

762-773: toFloat64 could handle additional numeric types for robustness.

The function handles float64, int64, and string, which covers most cases from JSON unmarshaling. Consider adding int for completeness, though it's unlikely to be needed in practice.

♻️ Optional: Handle int type
 func toFloat64(val any) (float64, error) {
 	switch v := val.(type) {
 	case float64:
 		return v, nil
 	case int64:
 		return float64(v), nil
+	case int:
+		return float64(v), nil
 	case string:
 		return strconv.ParseFloat(v, 64)
 	default:
 		return 0, fmt.Errorf("cannot convert %T to float64", val)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 762 - 773,
The toFloat64 function currently handles float64, int64 and string but should
also accept plain int values from various callers; update the type switch in
toFloat64 to add a case for int that returns float64(v), nil (and optionally
other integer types if desired), ensuring the function signature and error
behavior remain unchanged so calls to toFloat64 will correctly convert int
inputs.

306-309: Consider validating BMC connectivity during setup.

The BMC client is created but not validated until Step 4 when SystemPowerOff() is called. A connectivity check in BeforeAll could provide earlier feedback if BMC credentials or address are incorrect.

💡 Optional: Add BMC connectivity check
 			bmcClient = bmc.New(RHWAConfig.TargetWorkerBMC.Address).
 				WithRedfishUser(RHWAConfig.TargetWorkerBMC.Username,
 					RHWAConfig.TargetWorkerBMC.Password).
 				WithRedfishTimeout(nhcparams.BMCTimeout)
+
+			By("Verifying BMC connectivity")
+			powerState, err := bmcClient.SystemPowerState()
+			Expect(err).ToNot(HaveOccurred(), "Failed to connect to BMC - check address and credentials")
+			klog.Infof("BMC reports power state: %s", powerState)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 306 - 309,
After creating the BMC client via
bmc.New(...).WithRedfishUser(...).WithRedfishTimeout(...) perform an immediate
connectivity validation in the BeforeAll setup by invoking a lightweight BMC
call on bmcClient (e.g., a Ping/GetSystems/GetManagers-style method or any
existing health/status call) and fail the setup early if it returns an error;
place this check in BeforeAll before proceeding to tests so you detect bad
address/credentials earlier instead of waiting for SystemPowerOff in Step 4.
tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go (1)

27-30: Consider expanding ReporterCRDsToDump for better debugging.

Currently only PodList is dumped. For NHC/SNR debugging, consider adding NodeHealthCheck and SelfNodeRemediation CRDs to capture their state on test failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go` around lines 27 - 30,
ReporterCRDsToDump currently only includes PodList; expand it to also dump
NodeHealthCheck and SelfNodeRemediation CRs by adding CRData entries with the
list types (e.g., Cr: &NodeHealthCheckList{} and Cr: &SelfNodeRemediationList{})
to the ReporterCRDsToDump slice, and add the necessary imports for the NHC and
SNR API packages so their list types are available; update the
ReporterCRDsToDump variable to include these additional CRData entries alongside
the existing PodList entry.
tests/rhwa/nhc-operator/tests/planned-node-reboot.go (2)

111-114: Duplicate patch byte slices can be consolidated.

addLabelPatch (line 113-114) and addLabelPatchFO (line 239-240) are identical. Consider reusing the same variable to reduce duplication.

♻️ Suggested consolidation
 			addLabelPatch := []byte(
 				fmt.Sprintf(`{"metadata":{"labels":{%q:""}}}`, nhcparams.AppWorkerLabel))
-
-			if len(RHWAConfig.FailoverWorkers) > 0 {
 ...
-			By("Labeling failover worker nodes (for post-upgrade app scheduling)")
-
-			addLabelPatchFO := []byte(
-				fmt.Sprintf(`{"metadata":{"labels":{%q:""}}}`, nhcparams.AppWorkerLabel))
-
 			for _, workerName := range RHWAConfig.FailoverWorkers {
 ...
 				_, patchErr := APIClient.K8sClient.CoreV1().Nodes().Patch(
 					context.TODO(), workerName, types.MergePatchType,
-					addLabelPatchFO, metav1.PatchOptions{})
+					addLabelPatch, metav1.PatchOptions{})

Also applies to: 239-240

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go` around lines 111 - 114,
The two identical byte-slice variables addLabelPatch and addLabelPatchFO should
be consolidated into a single reusable variable to remove duplication: define
one addLabelPatch byte slice (using nhcparams.AppWorkerLabel) and replace usages
of addLabelPatchFO with that single variable (keep removeLabelPatch as-is);
ensure references in both places that previously used addLabelPatchFO now point
to addLabelPatch so behavior and formatting remain unchanged.

31-42: ContinueOnFailure with shared state may lead to cascading errors.

With ContinueOnFailure, if an early It block fails, subsequent blocks still run but may operate on an inconsistent state. For example, if "Step 4: Initiates cluster upgrade" fails, "Step 5" and later will still execute against a cluster that may not be upgrading. Consider whether this is the intended behavior or if sequential dependency should cause early exit.

Additionally, initialVersion (line 38) is assigned in BeforeAll but never used in the test. If it's intended for post-upgrade verification, consider adding that check or removing the unused variable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go` around lines 31 - 42,
The test uses ContinueOnFailure which allows later It blocks to run against a
potentially inconsistent state—remove ContinueOnFailure from the Describe
options (leave Ordered and Label) or replace it with a fail-fast behavior so an
early failure stops subsequent steps (locate the Describe invocation around the
Planned reboot test). Also address the unused initialVersion variable set in
BeforeAll: either add a post-upgrade verification It that asserts the cluster
reached the expected version using initialVersion or remove the variable if not
needed (look for initialVersion and the BeforeAll that assigns it).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Around line 23-33: Update the README text: fix the JSON example for
ECO_RHWA_NHC_TARGET_WORKER_BMC by removing the stray closing parenthesis so the
JSON ends with } not }) and hyphenate compound adjectives like "bare-metal"
(change "bare metal nodes" to "bare-metal nodes"); apply the same typo/style
fixes to the other occurrence mentioned (lines ~134-144) so both instances of
the BMC JSON example and any compound-adjective uses are corrected.

In `@tests/rhwa/nhc-operator/tests/nhc.go`:
- Around line 38-45: The test currently ignores the boolean return from
pod.WaitForAllPodsInNamespaceRunning which can yield a vacuous true when no pods
are selected; update the call to capture its boolean result (e.g., ok, err :=
pod.WaitForAllPodsInNamespaceRunning(...)) and add an assertion that the boolean
is true (Expect(ok).To(BeTrue(), "expected pods to be found and running")) in
addition to the existing Expect(err).ToNot(HaveOccurred()) so the test fails
when no pods are found running.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 457-466: The test currently ignores when the NHC resource lacks a
status; change it to fail the test if status is missing by asserting
Expect(hasStatus).To(BeTrue(), "NHC resource must have status after upgrade")
before accessing nhcStatus, then proceed to check nhcStatus["healthyNodes"],
nhcStatus["observedNodes"], and the absence of nhcStatus["unhealthyNodes"];
reference the existing nhcResource.Object lookup and the local variables
nhcStatus and hasStatus when making this assertion so the test fails fast and
clearly when no status is present.

---

Nitpick comments:
In `@tests/rhwa/internal/rhwaconfig/default.yaml`:
- Around line 4-11: The default YAML is missing the RHWAConfig upgrade-related
keys; update the config template to include the upgrade inputs that RHWAConfig
now expects (e.g., add default keys such as upgrade_enabled,
upgrade_target_version, planned_reboot_during_upgrade and any other RHWAConfig
upgrade_* fields) with sensible empty/false defaults so the
planned-reboot-during-upgrade flow is configurable; ensure the keys match the
RHWAConfig property names exactly so code parsing functions (e.g., the
RHWAConfig loader/validator) will recognize them.

In `@tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go`:
- Around line 27-30: ReporterCRDsToDump currently only includes PodList; expand
it to also dump NodeHealthCheck and SelfNodeRemediation CRs by adding CRData
entries with the list types (e.g., Cr: &NodeHealthCheckList{} and Cr:
&SelfNodeRemediationList{}) to the ReporterCRDsToDump slice, and add the
necessary imports for the NHC and SNR API packages so their list types are
available; update the ReporterCRDsToDump variable to include these additional
CRData entries alongside the existing PodList entry.

In `@tests/rhwa/nhc-operator/README.md`:
- Around line 92-113: Add a new test-suite row for planned-reboot-during-upgrade
(pointing to tests/planned-reboot-during-upgrade.go) and document the new
environment variables used by that flow; specifically list
ECO_RHWA_NHC_UPGRADE_IMAGE, ECO_RHWA_NHC_UPGRADE_REPO and
ECO_RHWA_NHC_UPGRADE_TARGETS (or the exact upgrade-related env var names
introduced in this PR) with short descriptions of their meanings (upgrade
image/repo and which nodes to reboot) so the README’s "Test suites" and "Inputs"
sections match the new planned-reboot-during-upgrade implementation and the
constants in internal/nhcparams/const.go.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 111-114: The two identical byte-slice variables addLabelPatch and
addLabelPatchFO should be consolidated into a single reusable variable to remove
duplication: define one addLabelPatch byte slice (using
nhcparams.AppWorkerLabel) and replace usages of addLabelPatchFO with that single
variable (keep removeLabelPatch as-is); ensure references in both places that
previously used addLabelPatchFO now point to addLabelPatch so behavior and
formatting remain unchanged.
- Around line 31-42: The test uses ContinueOnFailure which allows later It
blocks to run against a potentially inconsistent state—remove ContinueOnFailure
from the Describe options (leave Ordered and Label) or replace it with a
fail-fast behavior so an early failure stops subsequent steps (locate the
Describe invocation around the Planned reboot test). Also address the unused
initialVersion variable set in BeforeAll: either add a post-upgrade verification
It that asserts the cluster reached the expected version using initialVersion or
remove the variable if not needed (look for initialVersion and the BeforeAll
that assigns it).

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 447-460: The current block only handles percentage minHealthy
values (minHealthyStr ending with "%"); add an else branch to validate absolute
minHealthy numbers: when strings.CutSuffix(minHealthyStr, "%") returns
ok==false, parse minHealthyStr as a float (reuse strconv.ParseFloat) into an
absoluteMin value, compute minRequired := math.Ceil(absoluteMin) (or just
absoluteMin if you prefer integer semantics), and then perform the same
conversions of observedNodes/healthyNodes via toFloat64 and the Expect check
comparing healthy >= minRequired (use the same error message formatting). Update
the code around minHealthyStr, pctStr, toFloat64, and minRequired to support
both percentage and absolute forms.
- Around line 762-773: The toFloat64 function currently handles float64, int64
and string but should also accept plain int values from various callers; update
the type switch in toFloat64 to add a case for int that returns float64(v), nil
(and optionally other integer types if desired), ensuring the function signature
and error behavior remain unchanged so calls to toFloat64 will correctly convert
int inputs.
- Around line 306-309: After creating the BMC client via
bmc.New(...).WithRedfishUser(...).WithRedfishTimeout(...) perform an immediate
connectivity validation in the BeforeAll setup by invoking a lightweight BMC
call on bmcClient (e.g., a Ping/GetSystems/GetManagers-style method or any
existing health/status call) and fail the setup early if it returns an error;
place this check in BeforeAll before proceeding to tests so you detect bad
address/credentials earlier instead of waiting for SystemPowerOff in Step 4.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faf4e80f-e33e-4f24-9b5b-9bd9cdd3ec7c

📥 Commits

Reviewing files that changed from the base of the PR and between 88fe5ec and c1eef8a.

📒 Files selected for processing (12)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/planned-node-reboot.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go

Comment thread tests/rhwa/nhc-operator/README.md Outdated
Comment thread tests/rhwa/nhc-operator/tests/nhc.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/planned-node-reboot.go Outdated
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from c1eef8a to afa9fdd Compare March 20, 2026 18:06
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/internal/rhwaconfig/rhwaconfig.go`:
- Around line 30-35: In BMCDetails.Decode, avoid unmarshalling directly into the
receiver (method Decode on BMCDetails) because json.Unmarshal merges fields and
can leave a mix of YAML and env values; instead, unmarshal into a temporary
BMCDetails value (e.g., tmp := BMCDetails{}), check for empty input as now, call
json.Unmarshal into that tmp, and only if unmarshal succeeds assign *b = tmp so
the replacement is atomic and no partial overrides from environment variables
remain.

In `@tests/rhwa/nhc-operator/README.md`:
- Around line 159-160: Replace the non-copyable shell examples that use angle
brackets (e.g., `export KUBECONFIG=</path/to/kubeconfig>`) with plain,
copy-pasteable paths (for example `export KUBECONFIG=/path/to/kubeconfig` or
`export KUBECONFIG=$HOME/.kube/config`) so the shell does not interpret `<` as
redirection; update both occurrences mentioned (the snippet at lines ~159-160
and the similar snippet at ~173-174) to use a plain path placeholder instead of
`<...>`.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 353-360: The closure passed to Eventually is currently swallowing
list errors and can falsely return true; modify the closure used with Eventually
(the anonymous func that calls
APIClient.Resource(snrGVR).Namespace(rhwaparams.RhwaOperatorNs).List(...)) so
that if listErr != nil it logs the error and returns false (do not set
snrCreated), ensuring the Eventually loop keeps retrying on List failures;
otherwise proceed with the existing check of len(snrList.Items) and snrCreated
as before.
- Around line 382-397: The current check only asserts the type via
nhcStatus["unhealthyNodes"].([]any) and treats any slice (including empty) as
“unhealthy”; change both locations that use the type assertion (the polling
block around the unhealthyNames build and the post-upgrade assertion) to also
check the slice length (len(unhealthyNodes) > 0) before treating nodes as
unhealthy — only build unhealthyNames, log the warning, and set
hasUnhealthy-logic when the slice is non-empty; otherwise fall through to the
healthy/info branch or treat as no unhealthy nodes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f700315-84f6-4310-91c4-1711f906e27d

📥 Commits

Reviewing files that changed from the base of the PR and between c1eef8a and afa9fdd.

📒 Files selected for processing (6)
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/planned-node-reboot.go
✅ Files skipped from review due to trivial changes (1)
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Comment thread tests/rhwa/internal/rhwaconfig/rhwaconfig.go Outdated
Comment thread tests/rhwa/nhc-operator/README.md Outdated
Comment thread tests/rhwa/nhc-operator/tests/planned-node-reboot.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/planned-node-reboot.go Outdated
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from afa9fdd to c7c9bff Compare March 20, 2026 19:02
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from c7c9bff to f6e8a25 Compare March 20, 2026 19:12
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from f6e8a25 to 036ec49 Compare March 20, 2026 19:29
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from 036ec49 to 2412f58 Compare March 20, 2026 19:38
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tests/rhwa/internal/rhwaconfig/rhwaconfig.go (1)

30-40: ⚠️ Potential issue | 🟠 Major

Clear BMCDetails when the env override is blank.

Decode currently returns on empty/whitespace input without touching *b. If ECO_RHWA_NHC_TARGET_WORKER_BMC is explicitly set empty, the YAML/default BMC settings survive, so the sudden-loss test can still use stale credentials or the wrong endpoint.

🛠️ Suggested fix
 func (b *BMCDetails) Decode(value string) error {
-	if strings.TrimSpace(value) == "" {
-		return nil
+	trimmed := strings.TrimSpace(value)
+	if trimmed == "" {
+		*b = BMCDetails{}
+		return nil
 	}
 
 	var tmp BMCDetails
-	if err := json.Unmarshal([]byte(value), &tmp); err != nil {
+	if err := json.Unmarshal([]byte(trimmed), &tmp); err != nil {
 		return err
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/internal/rhwaconfig/rhwaconfig.go` around lines 30 - 40, The
Decode method should clear the receiver when the input is empty: in
BMCDetails.Decode, when strings.TrimSpace(value) == "" assign an empty
BMCDetails to *b (e.g., *b = BMCDetails{}) before returning nil so any
previous/default values are removed when the env override is explicitly set
blank.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Around line 36-46: Update the README text to clarify that the test requires a
Redfish endpoint because tests/rhwa/nhc-operator/tests/sudden-node-loss.go only
constructs a Redfish client: explicitly state that VirtualBMC (vbmc) by itself
is IPMI-only and must be paired with a Redfish front-end such as sushy-emulator
(or an equivalent Redfish proxy) for the test to work; keep the existing example
for ECO_RHWA_NHC_TARGET_WORKER_BMC but add a note that its address must point to
a Redfish endpoint, not plain vbmc/IPMI.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 36-42: The test never verifies the worker actually rebooted:
capture the target node's boot identifier before the upgrade and assert it
changes afterward. Specifically, read Node.Status.NodeInfo.BootID (or another
stable boot marker) for targetNode prior to triggering the upgrade (using the
variables targetNode/upgradeImage flow), then after the ClusterVersion reaches
upgradeImage and the node is Ready again, re-read Node.Status.NodeInfo.BootID
and fail the test if it is unchanged; add this check alongside the existing
SNR/ClusterVersion waits so the planned-reboot path is actually exercised.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 516-548: The test currently treats any existing
SelfNodeRemediation (snrGVR via APIClient) for targetNode as proof of
remediation; to avoid false positives you must record and ignore pre-existing
SNRs created before the power-off. Before triggering the power-off, list current
SelfNodeRemediation objects (using
APIClient.Resource(snrGVR).Namespace(rhwaparams.RhwaOperatorNs).List(...)) and
build a set (e.g., preExistingSNRs by name or UID). Then in the Eventual polling
loop that searches snrList.Items, skip any SNR whose name/UID is in that
preExisting set and only treat a match as success when it is not pre-existing
(assign snrResourceName and return true only for new SNRs). Use the existing
identifiers snrGVR, APIClient, targetNode and snrResourceName to locate and
implement the change.
- Around line 628-640: Replace the best-effort event-only check with a
definitive Node taint assertion: after listing events (the existing
APIClient.CoreV1Interface.Events call and targetNode variable), fetch the Node
object via APIClient.CoreV1Interface.Nodes().Get(context.TODO(), targetNode,
metav1.GetOptions{}) and inspect node.Spec.Taints (or node.GetTaints()) for a
taint with Key "node.kubernetes.io/out-of-service" and Effect matching expected
value; if not present, fail the test (return an error or use the test
framework's Expect/Fail). Keep the existing event log as informational only, but
make the test outcome depend on the Node taint check instead of the event
presence.

---

Duplicate comments:
In `@tests/rhwa/internal/rhwaconfig/rhwaconfig.go`:
- Around line 30-40: The Decode method should clear the receiver when the input
is empty: in BMCDetails.Decode, when strings.TrimSpace(value) == "" assign an
empty BMCDetails to *b (e.g., *b = BMCDetails{}) before returning nil so any
previous/default values are removed when the env override is explicitly set
blank.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 219c99ca-237f-4aa5-85af-15d0aadef9d8

📥 Commits

Reviewing files that changed from the base of the PR and between afa9fdd and f6e8a25.

📒 Files selected for processing (7)
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/planned-node-reboot.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
✅ Files skipped from review due to trivial changes (1)
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Comment thread tests/rhwa/nhc-operator/README.md
Comment thread tests/rhwa/nhc-operator/tests/planned-node-reboot.go
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from 2412f58 to 0a29c4c Compare March 20, 2026 20:12
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

2 similar comments
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
tests/rhwa/nhc-operator/README.md (1)

56-58: ⚠️ Potential issue | 🟡 Minor

Clarify that plain IPMI is not enough here.

sudden-node-loss.go creates a Redfish client only, so "BMC/Redfish (or iLO/IPMI)" overstates what the test can talk to. Please say the endpoint must expose Redfish; iLO is fine via Redfish, but raw IPMI/vbmc alone is not.

📝 Suggested wording
-* The target worker node must have **BMC/Redfish** (or iLO/IPMI) access for power control.
+* The target worker node must have **Redfish-capable BMC** access for power control
+  (for example, iLO via Redfish). Plain IPMI/vbmc alone is not sufficient here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/README.md` around lines 56 - 58, Update the README
wording to require a Redfish-accessible BMC endpoint: change the phrase
"BMC/Redfish (or iLO/IPMI)" to state that the target worker node must expose
Redfish (iLO is OK when it exposes Redfish) and explicitly note that raw
IPMI/vbmc alone is not sufficient because sudden-node-loss.go creates a Redfish
client and only speaks Redfish.
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (2)

516-548: ⚠️ Potential issue | 🟠 Major

Ignore pre-existing SelfNodeRemediation objects.

This poll treats any leftover SNR for targetNode as proof that the current power-off triggered remediation. A stale CR from a previous run can satisfy Step 5 immediately, and Step 6/7 then reason over old status instead of the object created by this test. Snapshot existing SNR names/UIDs before Step 4 and only accept a new resource here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 516 - 548,
Before polling in the Eventually block that lists SelfNodeRemediation (the code
calling
APIClient.Resource(snrGVR).Namespace(rhwaparams.RhwaOperatorNs).List(...)),
snapshot existing SNR identifiers (names and/or UIDs) into a set (e.g.
preexistingSNRs) prior to Step 4; then change the polling loop that iterates
over snrList.Items to ignore any SNR whose name or UID is present in
preexistingSNRs and only treat an SNR for targetNode as matching if it is not in
that snapshot (set the snrResourceName from that new SNR). This ensures the
Eventually/WithTimeout block only accepts a newly created SelfNodeRemediation
rather than stale resources.

628-640: ⚠️ Potential issue | 🟠 Major

Assert the out-of-service taint on the Node.

This block only does a best-effort event lookup, so Step 6 can pass without proving node.kubernetes.io/out-of-service was ever set. Since the node stays powered off until cleanup, an Eventually on node.Spec.Taints is the definitive check.

🛠️ Suggested fix
-			By("Verifying out-of-service taint was applied (via events)")
+			By("Verifying out-of-service taint was applied")
+
+			Eventually(func() bool {
+				node, pullErr := nodes.Pull(APIClient, targetNode)
+				if pullErr != nil {
+					return false
+				}
+
+				for _, taint := range node.Object.Spec.Taints {
+					if taint.Key == nhcparams.OutOfServiceTaintKey {
+						return true
+					}
+				}
+
+				return false
+			}).WithTimeout(time.Minute).
+				WithPolling(nhcparams.PollingInterval).
+				Should(BeTrue(),
+					fmt.Sprintf("Node %s was not tainted with %s",
+						targetNode, nhcparams.OutOfServiceTaintKey))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 628 - 640,
The test only queries Events via APIClient.CoreV1Interface.Events(...) for an
AddOutOfService event and does not assert the actual node taint; update the test
to explicitly check the Node object's taints (for node name targetNode) by using
an Eventually that polls the Node (e.g., via
APIClient.CoreV1Interface.Nodes().Get or helper used elsewhere) and asserts
node.Spec.Taints contains the key "node.kubernetes.io/out-of-service" (and
desired effect/value), so the test fails if the taint was never applied rather
than relying solely on events or SNR phase for confirmation.
tests/rhwa/nhc-operator/tests/planned-node-reboot.go (1)

37-39: ⚠️ Potential issue | 🟠 Major

Prove that the target worker actually rebooted.

Right now this spec turns green once ClusterVersion reaches upgradeImage and no SNRs show up. A worker excluded from the rollout, or already on the desired config, can satisfy that without exercising the planned-reboot path at all. Capture the target node’s pre-upgrade BootID (or another reboot marker) and assert it changes before the test passes.

Also applies to: 155-158, 365-460

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go` around lines 37 - 39,
Capture the target node’s pre-upgrade reboot marker and assert it changed:
before triggering the update (where targetNode, initialVersion, upgradeImage are
set), call a helper like getNodeBootID(targetNode) and save it to preBootID;
after waiting for ClusterVersion to reach upgradeImage and no SNRs, call
getNodeBootID(targetNode) again (postBootID) and assert preBootID != postBootID.
Add or reuse a small helper (e.g., getNodeBootID or fetchBootIDFromNode) that
reads the node's /proc/sys/kernel/random/boot_id (or equivalent via SSH/K8s
exec) and return the string; add the assertion near the existing success checks
so the test only passes if the boot ID changed. Ensure similar changes are
applied to the other indicated sections (the duplicate ranges) that perform the
upgrade check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go`:
- Around line 38-39: The SNRFenceTimeout constant (SNRFenceTimeout in
nhcvars.go) is set to 5 minutes but README and worst-case fencing take 5–8
minutes; increase the constant to exceed that worst case (e.g., set
SNRFenceTimeout to 10 * time.Minute) and update the inline comment to reflect
the new timeout so the sudden-loss test waiting on this value won't time out on
slower clusters.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 395-396: The NodeHealthCheck reads using
APIClient.Resource(nhcGVR).Get(...) are missing namespace scoping, so update
both calls (the ones fetching nhcResource with nhcGVR and
nhcparams.NHCResourceName) to include .Namespace(rhwaparams.RhwaOperatorNs)
before .Get(...); this ensures the namespaced NodeHealthCheck resource
(nhc-worker-self) is fetched from the openshift-workload-availability namespace
rather than the cluster scope.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 429-430: The NodeHealthCheck lookups currently call
APIClient.Resource(nhcGVR).Get without a namespace, hitting the cluster-scoped
endpoint; change these Get calls (the one using nhcparams.NHCResourceName and
the other at the similar location) to use the namespaced variant by passing
rhwaparams.RhwaOperatorNs as the namespace argument (i.e., use
APIClient.Resource(nhcGVR).Namespace(rhwaparams.RhwaOperatorNs).Get or the
equivalent namespaced Get overload) so Step 5 reads the NHC status from the
correct namespace.

---

Duplicate comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Around line 56-58: Update the README wording to require a Redfish-accessible
BMC endpoint: change the phrase "BMC/Redfish (or iLO/IPMI)" to state that the
target worker node must expose Redfish (iLO is OK when it exposes Redfish) and
explicitly note that raw IPMI/vbmc alone is not sufficient because
sudden-node-loss.go creates a Redfish client and only speaks Redfish.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 37-39: Capture the target node’s pre-upgrade reboot marker and
assert it changed: before triggering the update (where targetNode,
initialVersion, upgradeImage are set), call a helper like
getNodeBootID(targetNode) and save it to preBootID; after waiting for
ClusterVersion to reach upgradeImage and no SNRs, call getNodeBootID(targetNode)
again (postBootID) and assert preBootID != postBootID. Add or reuse a small
helper (e.g., getNodeBootID or fetchBootIDFromNode) that reads the node's
/proc/sys/kernel/random/boot_id (or equivalent via SSH/K8s exec) and return the
string; add the assertion near the existing success checks so the test only
passes if the boot ID changed. Ensure similar changes are applied to the other
indicated sections (the duplicate ranges) that perform the upgrade check.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 516-548: Before polling in the Eventually block that lists
SelfNodeRemediation (the code calling
APIClient.Resource(snrGVR).Namespace(rhwaparams.RhwaOperatorNs).List(...)),
snapshot existing SNR identifiers (names and/or UIDs) into a set (e.g.
preexistingSNRs) prior to Step 4; then change the polling loop that iterates
over snrList.Items to ignore any SNR whose name or UID is present in
preexistingSNRs and only treat an SNR for targetNode as matching if it is not in
that snapshot (set the snrResourceName from that new SNR). This ensures the
Eventually/WithTimeout block only accepts a newly created SelfNodeRemediation
rather than stale resources.
- Around line 628-640: The test only queries Events via
APIClient.CoreV1Interface.Events(...) for an AddOutOfService event and does not
assert the actual node taint; update the test to explicitly check the Node
object's taints (for node name targetNode) by using an Eventually that polls the
Node (e.g., via APIClient.CoreV1Interface.Nodes().Get or helper used elsewhere)
and asserts node.Spec.Taints contains the key
"node.kubernetes.io/out-of-service" (and desired effect/value), so the test
fails if the taint was never applied rather than relying solely on events or SNR
phase for confirmation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ad809f7e-7b9f-4b78-b15c-1dc6a98f91f5

📥 Commits

Reviewing files that changed from the base of the PR and between f6e8a25 and 2412f58.

📒 Files selected for processing (7)
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/planned-node-reboot.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
✅ Files skipped from review due to trivial changes (2)
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go

Comment thread tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/planned-node-reboot.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from 0a29c4c to 85b8ea1 Compare March 20, 2026 20:37
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

2 similar comments
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/rhwa/nhc-operator/README.md (1)

209-209: ⚠️ Potential issue | 🟡 Minor

Minor: Hyphenate "bare-metal" for consistency.

The compound adjective "bare metal" modifying "boot" should be hyphenated for consistency with other usages in this document (lines 36, 199).

📝 Suggested fix
-| AfterAll: Power on node & wait for Ready | ~5 min | Bare metal boot + kubelet registration |
+| AfterAll: Power on node & wait for Ready | ~5 min | Bare-metal boot + kubelet registration |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/README.md` at line 209, Update the table row string
that currently reads "| AfterAll: Power on node & wait for Ready | ~5 min | Bare
metal boot + kubelet registration |" to hyphenate the compound adjective so it
reads "| AfterAll: Power on node & wait for Ready | ~5 min | Bare-metal boot +
kubelet registration |"; this keeps the "Bare-metal" phrasing consistent with
other instances in the README (e.g., lines referencing "bare-metal") and only
requires replacing "Bare metal" with "Bare-metal" in that table cell.
🧹 Nitpick comments (2)
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (1)

269-273: Redundant reassignment of targetNode after verification.

Line 269 reassigns targetNode from the pod's node name, but line 272-273 immediately asserts it equals RHWAConfig.TargetWorker. The reassignment is unnecessary since the assertion would fail if they differ. Consider removing line 269 to avoid confusion.

♻️ Suggested simplification
-			targetNode = appPods[0].Object.Spec.NodeName
-			klog.Infof("Stateful app is running on node %s", targetNode)
+			actualNode := appPods[0].Object.Spec.NodeName
+			klog.Infof("Stateful app is running on node %s", actualNode)

-			Expect(targetNode).To(Equal(RHWAConfig.TargetWorker),
+			Expect(actualNode).To(Equal(RHWAConfig.TargetWorker),
 				"App pod should be pinned to the configured target node")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 269 - 273,
The assignment targetNode = appPods[0].Object.Spec.NodeName is redundant because
you immediately assert that value equals RHWAConfig.TargetWorker; remove the
standalone reassignment and use the pod's NodeName directly in the assertion (or
set targetNode once where it's first established) to avoid confusion—locate the
assignment near usages of targetNode and delete it, keeping the
Expect(...To(Equal(RHWAConfig.TargetWorker))) assertion as the verification.
tests/rhwa/nhc-operator/tests/planned-node-reboot.go (1)

249-253: Same reassignment pattern as sudden-node-loss.

Similar to the sudden-node-loss test, targetNode is reassigned from the pod and then immediately asserted to match the config. This is stylistically inconsistent but functionally correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go` around lines 249 - 253,
The reassignment of targetNode from appPods[0].Object.Spec.NodeName is
stylistically inconsistent; instead avoid mutating the previously-declared
targetNode by either asserting directly against the pod's node name or using a
new local variable. Replace the lines that set targetNode =
appPods[0].Object.Spec.NodeName and the Expect call with either
Expect(appPods[0].Object.Spec.NodeName).To(Equal(RHWAConfig.TargetWorker)) or
nodeName := appPods[0].Object.Spec.NodeName followed by
Expect(nodeName).To(Equal(RHWAConfig.TargetWorker)), keeping
RHWAConfig.TargetWorker and appPods as the referenced symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Line 209: Update the table row string that currently reads "| AfterAll: Power
on node & wait for Ready | ~5 min | Bare metal boot + kubelet registration |" to
hyphenate the compound adjective so it reads "| AfterAll: Power on node & wait
for Ready | ~5 min | Bare-metal boot + kubelet registration |"; this keeps the
"Bare-metal" phrasing consistent with other instances in the README (e.g., lines
referencing "bare-metal") and only requires replacing "Bare metal" with
"Bare-metal" in that table cell.

---

Nitpick comments:
In `@tests/rhwa/nhc-operator/tests/planned-node-reboot.go`:
- Around line 249-253: The reassignment of targetNode from
appPods[0].Object.Spec.NodeName is stylistically inconsistent; instead avoid
mutating the previously-declared targetNode by either asserting directly against
the pod's node name or using a new local variable. Replace the lines that set
targetNode = appPods[0].Object.Spec.NodeName and the Expect call with either
Expect(appPods[0].Object.Spec.NodeName).To(Equal(RHWAConfig.TargetWorker)) or
nodeName := appPods[0].Object.Spec.NodeName followed by
Expect(nodeName).To(Equal(RHWAConfig.TargetWorker)), keeping
RHWAConfig.TargetWorker and appPods as the referenced symbols.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 269-273: The assignment targetNode =
appPods[0].Object.Spec.NodeName is redundant because you immediately assert that
value equals RHWAConfig.TargetWorker; remove the standalone reassignment and use
the pod's NodeName directly in the assertion (or set targetNode once where it's
first established) to avoid confusion—locate the assignment near usages of
targetNode and delete it, keeping the
Expect(...To(Equal(RHWAConfig.TargetWorker))) assertion as the verification.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b73d0c8b-0422-4f1e-85d4-a050810ba66c

📥 Commits

Reviewing files that changed from the base of the PR and between 2412f58 and 85b8ea1.

📒 Files selected for processing (7)
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/planned-node-reboot.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
✅ Files skipped from review due to trivial changes (1)
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from 85b8ea1 to 84a1ea1 Compare March 20, 2026 22:27
@Demostenes777 Demostenes777 self-requested a review March 31, 2026 13:32
Demostenes777
Demostenes777 previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Collaborator

@Demostenes777 Demostenes777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: NHC & SNR System Tests

Approve and merge - The duplication is acceptable for initial implementation. Consider refactoring in a follow-up PR once PR #1272 is merged and the shared patterns become clearer.

Code review assisted by Cursor.

Refactorization suggestions

Stateful App Deployment Logic

Duplication: Lines 190-292 in sudden-node-loss.go and lines 170-292 in planned-node-reboot.go

Both tests create identical stateful apps with:

  • Namespace creation
  • PVC creation with the same configuration
  • Deployment with identical container spec, volume mount, readiness probe
  • Recreate deployment strategy

Recommendation: Extract to shared helper in tests/rhwa/internal/rhwahelpers/:

// CreateStatefulTestApp deploys a test application with persistent storage.
func CreateStatefulTestApp(
    apiClient *clients.Settings,
    targetNode string,
    storageClass string,
    appImage string,
) (*deployment.Builder, error)

Locations:

  • sudden-node-loss.go:190-292
  • planned-node-reboot.go:170-292

Node Label Management

Duplication: Complex label management logic appears 4 times across both files:

  1. Remove labels from failover nodes (BeforeAll) - ~20 lines each

    • sudden-node-loss.go:156-174
    • planned-node-reboot.go:134-154
  2. Add label to target node (BeforeAll) - ~15 lines each

    • sudden-node-loss.go:176-188
    • planned-node-reboot.go:156-168
  3. Label failover nodes (BeforeAll) - ~28 lines each

    • sudden-node-loss.go:275-302
    • planned-node-reboot.go:265-292
  4. Restore labels (AfterAll) - ~26 lines, IDENTICAL

    • sudden-node-loss.go:350-375
    • planned-node-reboot.go:305-330

You could reuse the suggested code in PR #1272

Helper Functions Only in One File

Location: sudden-node-loss.go:742-775

Three helper functions (isPodReady, filterRunningPods, toFloat64) are defined in sudden-node-loss.go but would be useful across tests:

  • isPodReady() - checks pod readiness condition
  • filterRunningPods() - filters pods by Running phase
  • toFloat64() - type conversion helper

You could reuse the suggested code in PR #1272

Node Readiness Checking

Duplication: Node readiness verification appears 5+ times with identical logic:

  • sudden-node-loss.go:121-139 (BeforeAll)
  • planned-node-reboot.go:72-90 (BeforeAll)
  • planned-node-reboot.go:545-563 (post-upgrade check)
  • More occurrences in Eventually blocks

You could reuse the suggested code in PR #1272

Function Length

Issue: BeforeAll blocks are 250+ lines, making them hard to review and test.

Locations:

  • sudden-node-loss.go:59-310 (251 lines)
  • planned-node-reboot.go:45-293 (248 lines)

Recommendation: Extract setup logic into well-named private functions within the test file:

func setupNodeLabels() (labeledWorkers, unlabeledWorkers []string)
func deployStatefulApp() (targetNode string)
func verifyPrerequisites()

Magic Numbers

Some values are hardcoded that could be constants:

  • sudden-node-loss.go:237: 5 * time.Minute for deployment wait
  • planned-node-reboot.go:343: 5 * time.Minute for namespace deletion

Recommendation: Add to nhcparams:

DeploymentReadyTimeout = 5 * time.Minute
NamespaceDeleteTimeout = 5 * time.Minute

Upgrade Rollback Not Handled ✅

Location: planned-node-reboot.go:350-374

The test initiates a cluster upgrade but has no rollback mechanism if the test fails mid-upgrade.

@rdiscala rdiscala marked this pull request as draft March 31, 2026 16:02
rdiscala added 2 commits April 1, 2026 12:47
Add a test suite that runs through the following scenario.

A healthy OpenShift cluster with a MNO topology experiences the sudden
shutdown of a worker node that is running a stateful application
(a simple Pod writing to a PV). The application is evicted and
rescheduled to another healthy node in order to provide business continuity.

Assisted-by: Claude
Add a test suite that runs through the following scenario.

A healthy OpenShift cluster with a MNO topology experiences the sudden
shutdown of a worker node that is running a stateful application
(a simple Pod writing to a PV). The application is evicted and
rescheduled to another healthy node in order to provide business continuity.

Assisted-by: Claude
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch 2 times, most recently from c4aad8f to f062293 Compare April 1, 2026 12:03
Verify that the NHC operator does NOT trigger remediation when
a worker node reboots as part of a planned OCP cluster upgrade.
The test deploys a stateful app, initiates a cluster upgrade,
and polls throughout the upgrade to confirm no SelfNodeRemediation
resources are created. Post-upgrade checks verify all nodes are
healthy, no out-of-service taints remain, and the app survived.

Related: ECOPROJECT-2283

Co-Authored-By: Claude
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-no-remediation-on-upgrade-test branch from f062293 to b4f0458 Compare April 1, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants