Skip to content

rhwa nhc: add NHC & SNR sudden-loss system test#1272

Draft
rdiscala wants to merge 1 commit intorh-ecosystem-edge:mainfrom
rdiscala:CNF-17181-nhc-snr-sudden-loss-test
Draft

rhwa nhc: add NHC & SNR sudden-loss system test#1272
rdiscala wants to merge 1 commit intorh-ecosystem-edge:mainfrom
rdiscala:CNF-17181-nhc-snr-sudden-loss-test

Conversation

@rdiscala
Copy link
Copy Markdown
Collaborator

@rdiscala rdiscala commented Mar 19, 2026

Add a test suite that runs through the following scenario.

A healthy OpenShift cluster with a MNO topology experiences the sudden shutdown of a worker node that is running a stateful application (a simple Pod writing to a PV). The application is evicted and rescheduled to another healthy node in order to provide business continuity.

Assisted-by: Claude

Summary by CodeRabbit

  • Tests

    • Added NHC and SNR readiness suites with lifecycle reporter hooks, JUnit/XML outputs, and ordered e2e coverage.
    • Added a comprehensive "sudden node loss" workflow exercising BMC power off, NHC detection, SNR fencing, PVC reattachment, and app failover.
  • Chores

    • Expanded test configuration with explicit defaults for target worker, failover workers, storage class, app image, and BMC credentials.
    • Added operator identifiers, GVRs, reporter targets, and tuned timeouts/intervals for tests.
  • Documentation

    • Added README detailing the sudden-node-loss scenario, prerequisites, env vars, and run instructions.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 19, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b99d5748-2b1f-41fc-a812-bd384ab3e9f1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Add NHC/SNR test infrastructure: explicit YAML defaults and BMC config, new BMCDetails type and RHWAConfig fields, operator constants and timeouts, Ginkgo suite entrypoints, reporter config, and new NHC/SNR e2e tests including a BMC-driven "sudden node loss" scenario.

Changes

Cohort / File(s) Summary
Configuration & Config Types
tests/rhwa/internal/rhwaconfig/default.yaml, tests/rhwa/internal/rhwaconfig/rhwaconfig.go
Replace YAML placeholder with explicit NHC defaults and nested BMC keys. Add exported BMCDetails type with Decode(string) error and extend RHWAConfig with TargetWorker, FailoverWorkers, StorageClass, AppImage, and TargetWorkerBMC (envconfig/yaml tags).
NHC operator constants & vars
tests/rhwa/nhc-operator/internal/nhcparams/const.go, tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
Add scenario/operator constants, app/PVC identifiers, reporter config, NhcGVR, and multiple timeout/polling duration variables and operator identity labels.
NHC test suites & tests
tests/rhwa/nhc-operator/nhc_suite_test.go, tests/rhwa/nhc-operator/tests/nhc.go
Add Ginkgo v2 suite entrypoint with JUnit/report hooks and a post-deployment readiness smoke test for the NHC operator.
NHC sudden-node-loss e2e
tests/rhwa/nhc-operator/tests/sudden-node-loss.go
Add large ordered e2e implementing the sudden-node-loss workflow: config gating, node labeling, namespace/PVC/stateful app creation, BMC power-off/on, NHC/SNR remediation checks (SelfNodeRemediation, fencing), rescheduling verification, and PV/VolumeAttachment inspections; includes helpers and cleanup.
SNR operator vars & tests
tests/rhwa/snr-operator/internal/snrparams/snrvars.go, tests/rhwa/snr-operator/snr_suite_test.go, tests/rhwa/snr-operator/tests/snr.go
Add SNR operator deployment/pod label vars, reporter CRD/ns dump config, Ginkgo v2 suite entrypoint with reporting hooks, and a post-deployment readiness test for the SNR operator.
Docs
tests/rhwa/nhc-operator/README.md
New README documenting the sudden-node-loss scenario, prerequisites, required operator config, storage expectations, env vars, run commands, and timing guidance.
RHWA GVR addition
tests/rhwa/internal/rhwaparams/rhwavars.go
Add exported SnrGVR GroupVersionResource for SelfNodeRemediation CRD.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'rhwa nhc: add NHC & SNR sudden-loss system test' clearly and specifically describes the main change: adding a new system test for the sudden-loss scenario involving NHC and SNR operators.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rdiscala
Copy link
Copy Markdown
Collaborator Author

Tested on kni-qe-3. An example wrapper script to run it would be:

#!/usr/bin/env bash
# Run the "Sudden loss of a node" Ginkgo test from eco-gotests on the bastion.
#
# This script sets all required environment variables for the NHC sudden-loss
# test and invokes `go test` against the nhc-operator test suite.
#
# Prerequisites:
#   - Go toolchain available in PATH
#   - eco-gotests repo cloned on the bastion
#   - KUBECONFIG pointing to a running OCP cluster
#   - NHC and SNR operators deployed on the cluster
#
# Usage:
#   ./run-sudden-node-loss-test.sh
#
#   # Override the target worker:
#   ./run-sudden-node-loss-test.sh --worker=openshift-worker-2.my-cluster.example.redhat.com
#
#   # Override BMC address:
#   ./run-sudden-node-loss-test.sh --bmc-address=10.1.0.2
#
#   # Pass extra arguments to go test / ginkgo:
#   ./run-sudden-node-loss-test.sh -- -ginkgo.v -ginkgo.timeout=30m
set -euo pipefail

# ── Defaults ────────────────────────────────────────────────────────────────

KUBECONFIG="${KUBECONFIG:-/home/kni/rdiscala/kubeconfig}"
ECO_GOTESTS_DIR="${ECO_GOTESTS_DIR:-${HOME}/rdiscala/src/eco-gotests}"

TARGET_WORKER="openshift-worker-0.my-cluster-example.redhat.com"
FAILOVER_WORKERS="openshift-worker-1.my-cluster.example.redhat.com"
STORAGE_CLASS="ocs-external-storagecluster-ceph-rbd"
APP_IMAGE="registry.example.redhat.com:5000/test/ubi-minimal:latest"
BMC_ADDRESS="10.1.0.2"
BMC_USERNAME="<myname?"
BMC_PASSWORD="<mypass>"

# ── Parse arguments ─────────────────────────────────────────────────────────

PASSTHROUGH_ARGS=()
for arg in "$@"; do
  case "${arg}" in
    --worker=*)       TARGET_WORKER="${arg#--worker=}" ;;
    --failover=*)     FAILOVER_WORKERS="${arg#--failover=}" ;;
    --storage-class=*)STORAGE_CLASS="${arg#--storage-class=}" ;;
    --app-image=*)    APP_IMAGE="${arg#--app-image=}" ;;
    --bmc-address=*)  BMC_ADDRESS="${arg#--bmc-address=}" ;;
    --bmc-user=*)     BMC_USERNAME="${arg#--bmc-user=}" ;;
    --bmc-password=*) BMC_PASSWORD="${arg#--bmc-password=}" ;;
    --eco-gotests=*)  ECO_GOTESTS_DIR="${arg#--eco-gotests=}" ;;
    --kubeconfig=*)   KUBECONFIG="${arg#--kubeconfig=}" ;;
    --)               shift; PASSTHROUGH_ARGS+=("$@"); break ;;
    *)                PASSTHROUGH_ARGS+=("${arg}") ;;
  esac
done

# ── Validate ────────────────────────────────────────────────────────────────

if [[ ! -f "${KUBECONFIG}" ]]; then
  echo "ERROR: KUBECONFIG not found at ${KUBECONFIG}" >&2
  exit 1
fi

if [[ ! -d "${ECO_GOTESTS_DIR}" ]]; then
  echo "ERROR: eco-gotests directory not found at ${ECO_GOTESTS_DIR}" >&2
  exit 1
fi

# ── Export environment ──────────────────────────────────────────────────────

export KUBECONFIG
export ECO_RHWA_NHC_TARGET_WORKER="${TARGET_WORKER}"
export ECO_RHWA_NHC_FAILOVER_WORKERS="${FAILOVER_WORKERS}"
export ECO_RHWA_NHC_STORAGE_CLASS="${STORAGE_CLASS}"
export ECO_RHWA_NHC_APP_IMAGE="${APP_IMAGE}"
export ECO_RHWA_NHC_TARGET_WORKER_BMC="{\"address\":\"${BMC_ADDRESS}\",\"username\":\"${BMC_USERNAME}\",\"password\":\"${BMC_PASSWORD}\"}"

# ── Run ─────────────────────────────────────────────────────────────────────

echo "=== Sudden loss of a node — Ginkgo test ==="
echo "  KUBECONFIG:       ${KUBECONFIG}"
echo "  eco-gotests:      ${ECO_GOTESTS_DIR}"
echo "  Target worker:    ${TARGET_WORKER}"
echo "  Failover workers: ${FAILOVER_WORKERS}"
echo "  Storage class:    ${STORAGE_CLASS}"
echo "  App image:        ${APP_IMAGE}"
echo "  BMC address:      ${BMC_ADDRESS}"
echo ""

cd "${ECO_GOTESTS_DIR}"

go test ./tests/rhwa/nhc-operator/... \
  -ginkgo.label-filter="sudden-loss" \
  -ginkgo.timeout=20m \
  -v \
  "${PASSTHROUGH_ARGS[@]}"

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/internal/rhwaconfig/rhwaconfig.go`:
- Around line 29-31: The Decode method on BMCDetails should defensively handle
empty or whitespace-only input before calling json.Unmarshal: in
BMCDetails.Decode(value string) check if strings.TrimSpace(value) == "" and if
so return nil (no-op) to avoid json EOF errors from envconfig; otherwise proceed
to json.Unmarshal into b and return that error. Ensure you import/qualify
strings if needed and do not change the existing unmarshalling behavior for
non-empty inputs.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 449-451: The code uses an unchecked type assertion when reading
snrResource.Object["spec"] into snrSpec (snrSpec :=
snrResource.Object["spec"].(map[string]any)), which can panic; change this to a
guarded assertion (e.g., spec, ok :=
snrResource.Object["spec"].(map[string]any)) and assert the ok value before
using snrSpec (use Expect(ok).To(BeTrue(), "expected spec to be map[string]any")
or similar) so that the subsequent
Expect(snrSpec["remediationStrategy"]).To(Equal("OutOfServiceTaint"), ...) is
only executed when the spec has the correct type.
- Around line 331-352: The code performs unchecked type assertions on
nhcResource.Object["spec"] and ["status"] and ignores errors from toFloat64,
which can cause panics or silent failures; update the block to validate
existence and types before asserting (check nhcResource.Object has "spec" and
"status" and that they cast to map[string]any), handle the case where minHealthy
is missing, and when using toFloat64 on observedNodes and healthyNodes capture
and handle the returned errors (fail the test with a clear message if conversion
fails) so that minHealthyStr parsing and the Expect comparison only run on
validated values.
- Around line 176-214: The test can schedule the pod on a failover worker
because
deployment.NewBuilder(...).WithNodeSelector(map[string]string{nhcparams.AppWorkerLabel:
""}) allows any worker, while the BMC client is created for
RHWAConfig.TargetWorkerBMC.Address; update the deployment to pin the pod to the
specific target worker (e.g., set WithNodeSelector to
map[string]string{nhcparams.AppWorkerLabel: RHWAConfig.TargetWorker} or add a
NodeName/NodeAffinity using the target) or, after pod.List and assigning
targetNode from appPods[0].Object.Spec.NodeName, assert/guard that targetNode ==
RHWAConfig.TargetWorker and call Skip(...) if it differs before creating
bmc.New(...), so the BMC address and powered-off node always match the pod
location.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a0e46c39-eef7-4c39-ab7b-db642b400112

📥 Commits

Reviewing files that changed from the base of the PR and between 504a0cb and 5cf83a0.

📒 Files selected for processing (10)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go

Comment thread tests/rhwa/internal/rhwaconfig/rhwaconfig.go
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch from 5cf83a0 to 741384b Compare March 19, 2026 16:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (2)

343-354: ⚠️ Potential issue | 🟠 Major

Handle both conversion failures and the non-% threshold path.

This block still ignores the toFloat64 errors, and it silently skips the threshold check whenever minHealthy is not stored as a percentage string. That weakens the precondition exactly where the unstructured payload changes type.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 343 - 354,
The minHealthy check currently ignores toFloat64 errors and skips the non-%
branch; update the block around strings.CutSuffix(minHealthyStr, "%") and the
uses of toFloat64 to validate conversions: call toFloat64(observedNodes) and
toFloat64(healthyNodes) and assert no error via Expect(parseErr). If
strings.CutSuffix returns ok, parse pct with strconv.ParseFloat as existing,
compute minRequired := math.Ceil(observed * pct / 100) and assert healthy >=
minRequired; else attempt to convert minHealthyStr to a numeric threshold by
calling toFloat64(minHealthyStr) (or strconv.ParseFloat) and Expect no error,
then assert healthy >= minRequiredAbs. Ensure each conversion failure uses
Expect(...).ToNot(HaveOccurred(), ...) with a clear message so neither
conversion nor the non-% path is silently ignored.

176-178: ⚠️ Potential issue | 🔴 Critical

Keep the BMC target and the watched node aligned.

Line 205 rewrites targetNode to wherever the pod actually landed, while Lines 210-213 still build the BMC client from RHWAConfig.TargetWorkerBMC. If the pod starts on a failover worker, Step 3 becomes self-fulfilling, Step 4 powers off one node and watches another, and recovery/cleanup target the wrong machine.

🔧 Minimal guard
 			targetNode = appPods[0].Object.Spec.NodeName
+			if targetNode != RHWAConfig.TargetWorker {
+				Skip(fmt.Sprintf(
+					"stateful app landed on %s, expected %s for the configured BMC target",
+					targetNode, RHWAConfig.TargetWorker,
+				))
+			}
 			klog.Infof("Stateful app is running on node %s", targetNode)

Also applies to: 205-213, 274-289

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 176 - 178,
The BMC client is being created from RHWAConfig.TargetWorkerBMC while targetNode
may be rewritten after pod scheduling, causing the watched node and the BMC
target to diverge; update the code that builds the BMC client to use the BMC
entry that corresponds to the final targetNode (e.g., look up
RHWAConfig.TargetWorkerBMC[targetNode.Name] or overwrite
RHWAConfig.TargetWorkerBMC with the BMC spec for targetNode) before creating the
client (affecting the code that constructs the BMC client and any uses around
targetNode, RHWAConfig.TargetWorkerBMC, and the deployment.NewBuilder flow), and
apply the same fix in the other occurrences mentioned (around the blocks at
205-213 and 274-289) so the powered-off machine and watched node always match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 107-124: The cleanup currently removes nhcparams.AppWorkerLabel
from every worker unconditionally; change the logic to record which nodes the
suite actually modified and only remove the label from those. In the setup loop
around allWorkers/workerNode (using nodes.Pull, WithNewLabel, Update) capture
the original label state (existence and value) or append workerName to a slice
like labeledWorkers when you call WithNewLabel/Update, and then in the cleanup
(the block corresponding to lines 254-270) iterate only over that recorded slice
to remove the label; if you stored the previous value, restore it instead of
deleting to avoid touching labels the suite did not create. Ensure references
remain to nhcparams.AppWorkerLabel, allWorkers, and the Update() call sites so
the change is easy to locate.
- Around line 166-174: The test advances as soon as the replacement pod is
PodRunning which can be before the PVC is usable; change the wait logic to wait
for the pod to be Ready (respecting the containerSpec.ReadinessProbe on
/data/heartbeat.log) instead of just Running. Locate the code that waits for the
replacement pod status (the helper currently checking for PodRunning) and
replace it with a wait for the PodReady condition (e.g., using the k8s Pod
condition "Ready" true or kubectl/wait-style polling), or call the existing
helper that checks readiness if present, so the test only proceeds when
readiness (not just running) is observed. Ensure the replacement pod check
references the same pod name/labels used when creating the replacement pod and
times out consistently with other waits.
- Around line 604-622: The check for old-node attachment sets foundOnOldNode for
any VolumeAttachment referencing the old node even if detached; update the loop
over vaList.Items (inspect va := &vaList.Items[idx] and
va.Spec.Source.PersistentVolumeName) so that foundOnOldNode is only set when
va.Spec.NodeName == targetNode AND va.Status.Attached is true, leaving the
foundOnNewNode logic unchanged, so the subsequent Expect(...) that asserts
foundOnOldNode is false only fails when the volume is actually still attached to
the old node.
- Around line 57-73: The preflight block that checks RHWAConfig fields must also
validate that RHWAConfig.FailoverWorkers is non-empty and that
RHWAConfig.TargetWorkerBMC.Username and RHWAConfig.TargetWorkerBMC.Password are
provided; add Skip() guards alongside the existing checks (the same section
around the By("Checking required configuration is provided") checks) to bail out
early if FailoverWorkers has length 0 or if
TargetWorkerBMC.Username/TargetWorkerBMC.Password are empty so the suite never
proceeds with missing inputs that would cause mid-test failures.

---

Duplicate comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 343-354: The minHealthy check currently ignores toFloat64 errors
and skips the non-% branch; update the block around
strings.CutSuffix(minHealthyStr, "%") and the uses of toFloat64 to validate
conversions: call toFloat64(observedNodes) and toFloat64(healthyNodes) and
assert no error via Expect(parseErr). If strings.CutSuffix returns ok, parse pct
with strconv.ParseFloat as existing, compute minRequired := math.Ceil(observed *
pct / 100) and assert healthy >= minRequired; else attempt to convert
minHealthyStr to a numeric threshold by calling toFloat64(minHealthyStr) (or
strconv.ParseFloat) and Expect no error, then assert healthy >= minRequiredAbs.
Ensure each conversion failure uses Expect(...).ToNot(HaveOccurred(), ...) with
a clear message so neither conversion nor the non-% path is silently ignored.
- Around line 176-178: The BMC client is being created from
RHWAConfig.TargetWorkerBMC while targetNode may be rewritten after pod
scheduling, causing the watched node and the BMC target to diverge; update the
code that builds the BMC client to use the BMC entry that corresponds to the
final targetNode (e.g., look up RHWAConfig.TargetWorkerBMC[targetNode.Name] or
overwrite RHWAConfig.TargetWorkerBMC with the BMC spec for targetNode) before
creating the client (affecting the code that constructs the BMC client and any
uses around targetNode, RHWAConfig.TargetWorkerBMC, and the
deployment.NewBuilder flow), and apply the same fix in the other occurrences
mentioned (around the blocks at 205-213 and 274-289) so the powered-off machine
and watched node always match.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4bc6dfb0-91b8-45c9-91f1-9b90185ffe8c

📥 Commits

Reviewing files that changed from the base of the PR and between 5cf83a0 and 741384b.

📒 Files selected for processing (10)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go
✅ Files skipped from review due to trivial changes (2)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/snr-operator/tests/snr.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go

Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go Outdated
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch from 741384b to 08d727b Compare March 19, 2026 17:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (4)

619-621: ⚠️ Potential issue | 🟡 Minor

Only flag old-node attachment when it is still attached.

foundOnOldNode is set for any VolumeAttachment referencing the old node, even if va.Status.Attached is false. CSI controllers may briefly leave a stale object with attached=false, causing a false failure after a healthy failover.

💽 Proposed fix
-			if va.Spec.NodeName == targetNode {
+			if va.Spec.NodeName == targetNode && va.Status.Attached {
 				foundOnOldNode = true
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 619 - 621,
The test currently sets foundOnOldNode whenever a VolumeAttachment's
va.Spec.NodeName equals targetNode; change the check to only set foundOnOldNode
when the VolumeAttachment is still attached by also verifying va.Status.Attached
is true (i.e., require both va.Spec.NodeName == targetNode and
va.Status.Attached == true) so stale/detached attachments on the old node are
ignored; update the logic where foundOnOldNode is assigned and any related
comments or variable names in the sudden-node-loss test to reflect this stricter
condition.

56-73: ⚠️ Potential issue | 🟡 Minor

Validate remaining required inputs before starting the suite.

The configuration validation is missing checks for FailoverWorkers (needed for pod rescheduling), and TargetWorkerBMC.Username/Password (used when creating BMC client). Missing these checks turns configuration mistakes into mid-test failures.

⚙️ Suggested guardrails
 			if RHWAConfig.TargetWorker == "" {
 				Skip("ECO_RHWA_NHC_TARGET_WORKER not set")
 			}
+
+			if len(RHWAConfig.FailoverWorkers) == 0 {
+				Skip("ECO_RHWA_NHC_FAILOVER_WORKERS not set")
+			}

 			if RHWAConfig.StorageClass == "" {
 				Skip("ECO_RHWA_NHC_STORAGE_CLASS not set")
 			}

 			if RHWAConfig.AppImage == "" {
 				Skip("ECO_RHWA_NHC_APP_IMAGE not set")
 			}

 			if RHWAConfig.TargetWorkerBMC.Address == "" {
 				Skip("ECO_RHWA_NHC_TARGET_WORKER_BMC address not set")
 			}
+
+			if RHWAConfig.TargetWorkerBMC.Username == "" || RHWAConfig.TargetWorkerBMC.Password == "" {
+				Skip("ECO_RHWA_NHC_TARGET_WORKER_BMC credentials not set")
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 56 - 73, The
BeforeAll validation currently checks only TargetWorker, StorageClass, AppImage
and TargetWorkerBMC.Address; add checks for RHWAConfig.FailoverWorkers (ensure
it's non-empty) and for RHWAConfig.TargetWorkerBMC.Username and
RHWAConfig.TargetWorkerBMC.Password (ensure both are non-empty) inside the same
BeforeAll block so missing config causes Skip with clear messages (e.g.,
"ECO_RHWA_NHC_FAILOVER_WORKERS not set",
"ECO_RHWA_NHC_TARGET_WORKER_BMC_USERNAME not set",
"ECO_RHWA_NHC_TARGET_WORKER_BMC_PASSWORD not set"), keeping the validation logic
consistent with existing checks for RHWAConfig and TargetWorkerBMC fields.

352-353: ⚠️ Potential issue | 🟡 Minor

Handle toFloat64 conversion errors.

The errors from toFloat64 are discarded. If conversion fails, observed and healthy will be 0, potentially causing incorrect threshold validation or misleading pass/fail results.

🛡️ Proposed fix
-			observed, _ := toFloat64(observedNodes)
-			healthy, _ := toFloat64(healthyNodes)
+			observed, obsErr := toFloat64(observedNodes)
+			Expect(obsErr).ToNot(HaveOccurred(), "Failed to convert observedNodes to float64")
+			healthy, healthyErr := toFloat64(healthyNodes)
+			Expect(healthyErr).ToNot(HaveOccurred(), "Failed to convert healthyNodes to float64")
 			minRequired := math.Ceil(observed * pct / 100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 352 - 353,
The code currently ignores errors returned by toFloat64 when converting
observedNodes and healthyNodes; update the conversion calls to check the
returned errors and handle them (e.g., fail the test or return an error) instead
of letting zero values proceed—replace the unchecked conversions for observed,
healthy from toFloat64(observedNodes) and toFloat64(healthyNodes) with
error-aware logic that asserts or t.Fatalf (or returns) including the conversion
error and context ("failed to convert observedNodes" / "failed to convert
healthyNodes") so any conversion failure surfaces immediately and prevents
incorrect threshold evaluation.

259-276: ⚠️ Potential issue | 🟡 Minor

Cleanup removes node labels the suite may not have created.

Setup skips labeling nodes that already have AppWorkerLabel (lines 113-117), but cleanup unconditionally removes the label from all workers. This can strip shared-cluster state the suite never owned.

Consider tracking which nodes the suite actually labeled and only removing from those.

🔧 Suggested approach
+		// Track nodes we actually labeled
+		var labeledByTest []string
+
 		BeforeAll(func() {
 			// ... existing code ...
 			for _, workerName := range allWorkers {
 				// ...
 				if _, exists := workerNode.Object.Labels[nhcparams.AppWorkerLabel]; exists {
 					klog.Infof("Node %s already has label %s, skipping", workerName, nhcparams.AppWorkerLabel)
 					continue
 				}

 				workerNode = workerNode.WithNewLabel(nhcparams.AppWorkerLabel, "")
 				_, updateErr := workerNode.Update()
 				Expect(updateErr).ToNot(HaveOccurred(), ...)
+				labeledByTest = append(labeledByTest, workerName)
 			}
 			// ...
 		})

 		AfterAll(func() {
 			// ...
 			By("Removing appworker label from nodes")
-			allWorkers := append([]string{targetNode}, RHWAConfig.FailoverWorkers...)
-			for _, workerName := range allWorkers {
+			for _, workerName := range labeledByTest {
 				// ... remove label ...
 			}
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 259 - 276,
The cleanup loop indiscriminately removes nhcparams.AppWorkerLabel from
allWorkers (built from targetNode and RHWAConfig.FailoverWorkers) even if the
suite didn’t create that label; modify the code to record which nodes the suite
actually labeled during setup (e.g., append node names to a labeledWorkers slice
when you successfully call workerNode.AddLabel or similar) and then in this
teardown use that labeledWorkers list instead of allWorkers and call nodes.Pull
+ workerNode.RemoveLabel + workerNode.Update only for entries in labeledWorkers;
ensure the tracking variable is accessible to both setup and teardown
(suite-scoped or attached to RHWAConfig) and skip removal for nodes not in that
tracked list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 619-621: The test currently sets foundOnOldNode whenever a
VolumeAttachment's va.Spec.NodeName equals targetNode; change the check to only
set foundOnOldNode when the VolumeAttachment is still attached by also verifying
va.Status.Attached is true (i.e., require both va.Spec.NodeName == targetNode
and va.Status.Attached == true) so stale/detached attachments on the old node
are ignored; update the logic where foundOnOldNode is assigned and any related
comments or variable names in the sudden-node-loss test to reflect this stricter
condition.
- Around line 56-73: The BeforeAll validation currently checks only
TargetWorker, StorageClass, AppImage and TargetWorkerBMC.Address; add checks for
RHWAConfig.FailoverWorkers (ensure it's non-empty) and for
RHWAConfig.TargetWorkerBMC.Username and RHWAConfig.TargetWorkerBMC.Password
(ensure both are non-empty) inside the same BeforeAll block so missing config
causes Skip with clear messages (e.g., "ECO_RHWA_NHC_FAILOVER_WORKERS not set",
"ECO_RHWA_NHC_TARGET_WORKER_BMC_USERNAME not set",
"ECO_RHWA_NHC_TARGET_WORKER_BMC_PASSWORD not set"), keeping the validation logic
consistent with existing checks for RHWAConfig and TargetWorkerBMC fields.
- Around line 352-353: The code currently ignores errors returned by toFloat64
when converting observedNodes and healthyNodes; update the conversion calls to
check the returned errors and handle them (e.g., fail the test or return an
error) instead of letting zero values proceed—replace the unchecked conversions
for observed, healthy from toFloat64(observedNodes) and toFloat64(healthyNodes)
with error-aware logic that asserts or t.Fatalf (or returns) including the
conversion error and context ("failed to convert observedNodes" / "failed to
convert healthyNodes") so any conversion failure surfaces immediately and
prevents incorrect threshold evaluation.
- Around line 259-276: The cleanup loop indiscriminately removes
nhcparams.AppWorkerLabel from allWorkers (built from targetNode and
RHWAConfig.FailoverWorkers) even if the suite didn’t create that label; modify
the code to record which nodes the suite actually labeled during setup (e.g.,
append node names to a labeledWorkers slice when you successfully call
workerNode.AddLabel or similar) and then in this teardown use that
labeledWorkers list instead of allWorkers and call nodes.Pull +
workerNode.RemoveLabel + workerNode.Update only for entries in labeledWorkers;
ensure the tracking variable is accessible to both setup and teardown
(suite-scoped or attached to RHWAConfig) and skip removal for nodes not in that
tracked list.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cfb0112-8602-4389-8a74-deac551982f9

📥 Commits

Reviewing files that changed from the base of the PR and between 741384b and 08d727b.

📒 Files selected for processing (10)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go
✅ Files skipped from review due to trivial changes (2)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/rhwa/snr-operator/tests/snr.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 2 times, most recently from 1100fe0 to 92c89ca Compare March 19, 2026 17:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (1)

358-367: ⚠️ Potential issue | 🟡 Minor

Don't swallow toFloat64 failures in the minHealthy check.

Lines 362-363 still ignore conversion errors, so an unexpected observedNodes or healthyNodes type becomes 0 and this assertion can evaluate the wrong threshold. Fail fast if either conversion fails.

🔧 Suggested fix
-				observed, _ := toFloat64(observedNodes)
-				healthy, _ := toFloat64(healthyNodes)
+				observed, obsErr := toFloat64(observedNodes)
+				Expect(obsErr).ToNot(HaveOccurred(), "Failed to convert observedNodes")
+				healthy, healthyErr := toFloat64(healthyNodes)
+				Expect(healthyErr).ToNot(HaveOccurred(), "Failed to convert healthyNodes")
 				minRequired := math.Ceil(observed * pct / 100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 358 - 367,
The conversions for observedNodes and healthyNodes currently ignore errors from
toFloat64; change the code to capture both returned values (e.g., observed,
obsErr := toFloat64(observedNodes) and healthy, healthyErr :=
toFloat64(healthyNodes)) and assert they did not fail (use
Expect(obsErr).ToNot(HaveOccurred(), ...) and
Expect(healthyErr).ToNot(HaveOccurred(), ...)) before computing minRequired and
performing the Expect(healthy).To(BeNumerically(...)) check so conversion
failures fail fast; keep the existing variables (observed, healthy, minRequired)
and error messages descriptive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 72-74: The check only ensures RHWAConfig.FailoverWorkers is
non-empty but doesn't verify a usable failover target; update the preflight in
the Sudden Node Loss test (and similar guard around the Step 7 region) to
require at least one entry in RHWAConfig.FailoverWorkers that is not equal to
TargetWorker and that is currently healthy/Ready (use the existing readiness
helper or inspect the worker's Ready condition); if no such distinct, healthy
failover worker exists, call Skip("no distinct healthy failover worker
available") so the outage isn't started.

---

Duplicate comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 358-367: The conversions for observedNodes and healthyNodes
currently ignore errors from toFloat64; change the code to capture both returned
values (e.g., observed, obsErr := toFloat64(observedNodes) and healthy,
healthyErr := toFloat64(healthyNodes)) and assert they did not fail (use
Expect(obsErr).ToNot(HaveOccurred(), ...) and
Expect(healthyErr).ToNot(HaveOccurred(), ...)) before computing minRequired and
performing the Expect(healthy).To(BeNumerically(...)) check so conversion
failures fail fast; keep the existing variables (observed, healthy, minRequired)
and error messages descriptive.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ab840eb3-b3de-46cc-9db3-55133bcd972f

📥 Commits

Reviewing files that changed from the base of the PR and between 08d727b and 1100fe0.

📒 Files selected for processing (10)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go
✅ Files skipped from review due to trivial changes (5)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/snr-operator/tests/snr.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Comment thread tests/rhwa/nhc-operator/tests/sudden-node-loss.go
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 4 times, most recently from 8e5c0b2 to 95bd946 Compare March 20, 2026 14:36
@rdiscala rdiscala marked this pull request as draft March 20, 2026 14:37
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 6 times, most recently from b19bbea to b77b02a Compare March 20, 2026 16:38
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala marked this pull request as ready for review March 20, 2026 17:11
@rdiscala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (2)

80-103: ⚠️ Potential issue | 🟠 Major

Require a schedulable failover worker, not just a Ready one.

Ready=True is not sufficient here. A cordoned node still passes this gate, but Kubernetes will not place new pods on an unschedulable node, so Step 7 can still dead-end after the target is powered off. Fold !failoverNode.Object.Spec.Unschedulable into this preflight check before starting the outage; if you want full coverage, also reject untolerated NoSchedule taints. (docs.okd.io)

⚙️ Minimal guard
 				failoverNode, pullErr := nodes.Pull(APIClient, failoverName)
 				if pullErr != nil {
 					klog.Warningf("Failed to pull failover node %s: %v", failoverName, pullErr)

 					continue
 				}
+
+				if failoverNode.Object.Spec.Unschedulable {
+					continue
+				}

 				for _, condition := range failoverNode.Object.Status.Conditions {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 80 - 103, The
check for a healthy failover worker currently only verifies Ready condition via
iterating failoverNode.Object.Status.Conditions and matching
corev1.NodeReady==corev1.ConditionTrue; update the preflight to also ensure the
node is schedulable by checking failoverNode.Object.Spec.Unschedulable is false
and optionally that the node does not have untolerated NoSchedule taints
(inspect failoverNode.Object.Spec.Taints and ensure either no NoSchedule taint
or that RHWA's pod tolerations would match). Modify the loop that processes
RHWAConfig.FailoverWorkers (and uses nodes.Pull) to fold these additional checks
before setting hasHealthyFailover and breaking.

156-174: ⚠️ Potential issue | 🟠 Major

Fail fast if another appworker node is already eligible for the initial placement.

This block only strips AppWorkerLabel from RHWAConfig.FailoverWorkers. If some other worker already carries the same empty-string label, the deployment on Lines 240-243 is still eligible to land there and the suite aborts before the outage scenario starts. A cheap guard here is to list all nodes matching AppWorkerLabel="" and Skip(...) unless the target is the only initial candidate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 156 - 174,
The loop only removes nhcparams.AppWorkerLabel from RHWAConfig.FailoverWorkers
but does not guard against other nodes already having AppWorkerLabel="" which
can accept the initial placement; before removing labels, list nodes using
APIClient.K8sClient.CoreV1().Nodes().List with a label selector for
nhcparams.AppWorkerLabel="" and if the returned candidate list length is >1 and
the current target (workerName from RHWAConfig.FailoverWorkers via nodes.Pull)
is not the sole candidate, call Skip(...) to fail fast; update the logic around
nodes.Pull / the unlabeledWorkers append to perform this check and only proceed
to Patch (APIClient.K8sClient.CoreV1().Nodes().Patch) when the target is the
only initial eligible node.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Around line 118-126: The fenced code block in README.md that contains the
commented export lines and "make run-tests" is missing a language tag; edit that
block by adding "bash" immediately after the opening triple backticks (so the
block starts with ```bash) to satisfy MD040 and enable shell syntax highlighting
for the exported variable sample.

In `@tests/rhwa/snr-operator/tests/snr.go`:
- Around line 38-44: List pods matching listOptions first using the existing
pod-listing helper (call pod.ListPodsInNamespace or the equivalent helper with
APIClient, rhwaparams.RhwaOperatorNs and listOptions), assert the returned pod
list is non-empty (fail the test if zero pods matched, pointing at
rhwaparams.OperatorControllerPodLabel), then call
pod.WaitForAllPodsInNamespaceRunning (or use its boolean return) to wait for
readiness; do not discard the boolean result so a misconfigured
OperatorControllerPodLabel cannot silently pass.

---

Duplicate comments:
In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 80-103: The check for a healthy failover worker currently only
verifies Ready condition via iterating failoverNode.Object.Status.Conditions and
matching corev1.NodeReady==corev1.ConditionTrue; update the preflight to also
ensure the node is schedulable by checking
failoverNode.Object.Spec.Unschedulable is false and optionally that the node
does not have untolerated NoSchedule taints (inspect
failoverNode.Object.Spec.Taints and ensure either no NoSchedule taint or that
RHWA's pod tolerations would match). Modify the loop that processes
RHWAConfig.FailoverWorkers (and uses nodes.Pull) to fold these additional checks
before setting hasHealthyFailover and breaking.
- Around line 156-174: The loop only removes nhcparams.AppWorkerLabel from
RHWAConfig.FailoverWorkers but does not guard against other nodes already having
AppWorkerLabel="" which can accept the initial placement; before removing
labels, list nodes using APIClient.K8sClient.CoreV1().Nodes().List with a label
selector for nhcparams.AppWorkerLabel="" and if the returned candidate list
length is >1 and the current target (workerName from RHWAConfig.FailoverWorkers
via nodes.Pull) is not the sole candidate, call Skip(...) to fail fast; update
the logic around nodes.Pull / the unlabeledWorkers append to perform this check
and only proceed to Patch (APIClient.K8sClient.CoreV1().Nodes().Patch) when the
target is the only initial eligible node.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7c567c8-ad53-4eb1-8fee-a7e869445531

📥 Commits

Reviewing files that changed from the base of the PR and between 1100fe0 and b77b02a.

📒 Files selected for processing (11)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go
✅ Files skipped from review due to trivial changes (2)
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/tests/nhc.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Comment thread tests/rhwa/nhc-operator/README.md Outdated
Comment thread tests/rhwa/snr-operator/tests/snr.go Outdated
Comment on lines +37 to +38
// OutOfServiceTaintKey is the taint key applied by SNR to fence a node.
OutOfServiceTaintKey = "node.kubernetes.io/out-of-service"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused constants

Comment on lines +13 to +14
// SNRDeploymentName is the name of the SNR operator deployment.
SNRDeploymentName = "self-node-remediation-controller-manager"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused constants

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Type: appsv1.RecreateDeploymentStrategyType,
}

_, err = deploy.CreateAndWaitUntilReady(5 * time.Minute)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a constant, avoid hardcoded values inside test code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

By("Deleting test namespace")

testNS := namespace.NewBuilder(APIClient, nhcparams.AppNamespace)
err := testNS.DeleteAndWait(5 * time.Minute)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please extract this as a constant

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +151 to +174
removeLabelPatch := []byte(
fmt.Sprintf(`{"metadata":{"labels":{%q:null}}}`, nhcparams.AppWorkerLabel))
addLabelPatch := []byte(
fmt.Sprintf(`{"metadata":{"labels":{%q:""}}}`, nhcparams.AppWorkerLabel))

for _, workerName := range RHWAConfig.FailoverWorkers {
workerNode, pullErr := nodes.Pull(APIClient, workerName)
if pullErr != nil {
continue
}

if _, exists := workerNode.Object.Labels[nhcparams.AppWorkerLabel]; exists {
klog.Infof("Temporarily removing %s label from failover node %s before deployment",
nhcparams.AppWorkerLabel, workerName)

_, patchErr := APIClient.K8sClient.CoreV1().Nodes().Patch(
context.TODO(), workerName, types.MergePatchType,
removeLabelPatch, metav1.PatchOptions{})
Expect(patchErr).ToNot(HaveOccurred(),
fmt.Sprintf("Failed to remove label from node %s", workerName))

unlabeledWorkers = append(unlabeledWorkers, workerName)
}
}
Copy link
Copy Markdown
Collaborator

@Demostenes777 Demostenes777 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this code leveraging a functionality already provided in the eco-goinfra node package. For example:

		removeLabelPatch := []byte(
		workerNode, err := nodes.Pull(APIClient, workerName)
        Expect(err).ToNot(HaveOccurred())

        // To add label:
        workerNode.WithNewLabel(nhcparams.AppWorkerLabel, "")
        _, err = workerNode.Update()

        // To remove label:
        workerNode.RemoveLabel(nhcparams.AppWorkerLabel, "")
        _, err = workerNode.Update()

Comment on lines +117 to +139
targetNode = RHWAConfig.TargetWorker

By("Waiting for target worker node to be Ready")

Eventually(func() bool {
targetNodeObj, pullErr := nodes.Pull(APIClient, targetNode)
if pullErr != nil {
klog.Infof(" waiting for node %s: %v", targetNode, pullErr)

return false
}

for _, condition := range targetNodeObj.Object.Status.Conditions {
if condition.Type == corev1.NodeReady {
return condition.Status == corev1.ConditionTrue
}
}

return false
}).WithTimeout(nhcparams.NodeRecoveryTimeout).
WithPolling(nhcparams.PollingInterval).
Should(BeTrue(),
fmt.Sprintf("Target worker node %s did not become Ready", targetNode))
Copy link
Copy Markdown
Collaborator

@Demostenes777 Demostenes777 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reuse functionalities already available in the eco-goinfra nodes package. Example

	// Wait for node to become Ready:
    targetNodeObj, _ := nodes.Pull(APIClient, targetNode)
    err := targetNodeObj.WaitUntilReady(nhcparams.NodeRecoveryTimeout)
    Expect(err).ToNot(HaveOccurred())

    // Wait for node to become NotReady (Unknown or False):
    err := targetNodeObj.WaitUntilNotReady(nhcparams.NodeReadyTimeout)
    Expect(err).ToNot(HaveOccurred())

    // Wait for specific condition to be Unknown:
    err := targetNodeObj.WaitUntilConditionUnknown(corev1.NodeReady, nhcparams.NodeReadyTimeout)
    Expect(err).ToNot(HaveOccurred())

Comment on lines +742 to +750
func isPodReady(podBuilder *pod.Builder) bool {
for _, condition := range podBuilder.Object.Status.Conditions {
if condition.Type == corev1.PodReady {
return condition.Status == corev1.ConditionTrue
}
}

return false
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the podBuilder.isHealthy() method.

Comment on lines +752 to +762
func filterRunningPods(pods []*pod.Builder) []*pod.Builder {
var running []*pod.Builder

for _, p := range pods {
if p.Object.Status.Phase == corev1.PodRunning {
running = append(running, p)
}
}

return running
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have alternative methods to achieve this with functionalities already available in the eco-goinfra pod package.

Option 1: Use field selector in the initial List call

// Already partially done with FieldSelector, but could eliminate the filter function
  appPods, err := pod.List(APIClient, nhcparams.AppNamespace, metav1.ListOptions{
      LabelSelector: fmt.Sprintf("%s=%s", nhcparams.AppLabelKey, nhcparams.AppLabelValue),
      FieldSelector: "status.phase=Running",
  `)

Option 2: Use eco-goinfra's waiting functions

// For waiting until pods are running allRunning, err := pod.WaitForAllPodsInNamespaceRunning( APIClient, nhcparams.AppNamespace, timeout, metav1.ListOptions{LabelSelector: "app=stateful-app"}, )

Comment on lines +33 to +45
var (
nhcGVR = schema.GroupVersionResource{
Group: "remediation.medik8s.io",
Version: "v1alpha1",
Resource: "nodehealthchecks",
}

snrGVR = schema.GroupVersionResource{
Group: "self-node-remediation.medik8s.io",
Version: "v1alpha1",
Resource: "selfnoderemediations",
}
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this variables be extracted to the vars file?

for _, workerName := range RHWAConfig.FailoverWorkers {
workerNode, pullErr := nodes.Pull(APIClient, workerName)
if pullErr != nil {
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be logged and might worth retrying to make it more reliable

Copy link
Copy Markdown
Collaborator

@Demostenes777 Demostenes777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Rigel! I have a few refactorization suggestions, mostly avoiding hardcoded values, keeping variable definitions in the *vars.go files and reusing as much as possible funtionalities already provided in the eco-goinfra packages.

Moreover, if you detect missing functionalities regarding nodes, pods, etc... I strongly suggest making a contribution to the eco-goinfra project instead of here.

WithRedfishTimeout(nhcparams.BMCTimeout)
})

AfterAll(func() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to use DeferCleanup ?

@rdiscala rdiscala self-assigned this Mar 31, 2026
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 4 times, most recently from 4256b0d to 3cc4874 Compare March 31, 2026 14:08
@rdiscala rdiscala marked this pull request as draft March 31, 2026 16:02
@rdiscala
Copy link
Copy Markdown
Collaborator Author

rdiscala commented Apr 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/rhwa/nhc-operator/README.md (1)

23-23: Minor: Hyphenate "bare-metal" when used as a compound adjective.

Static analysis flagged this: "bare metal" modifying "nodes" should be hyphenated as "bare-metal nodes". Same applies to lines 134 and 144.

📝 Proposed fix
-It has been tested on bare metal nodes. For virtualised infrastructure, a virtual BMC must be used,
+It has been tested on bare-metal nodes. For virtualised infrastructure, a virtual BMC must be used,

Line 134:

-broken down as follows (observed on a 4-worker bare metal cluster with `unhealthyConditions.duration=60s`
+broken down as follows (observed on a 4-worker bare-metal cluster with `unhealthyConditions.duration=60s`

Line 144:

-| AfterAll: Power on node & wait for Ready | ~5 min | Bare metal boot + kubelet registration |
+| AfterAll: Power on node & wait for Ready | ~5 min | Bare-metal boot + kubelet registration |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/README.md` at line 23, Replace the unhyphenated
phrase "bare metal" used as a compound adjective with "bare-metal" in the
README; specifically locate the occurrences that modify "nodes" (the instance at
the context around line 23 plus the other occurrences referenced) and update
them to "bare-metal nodes" to ensure consistent compound-adjective hyphenation
throughout the document.
tests/rhwa/nhc-operator/tests/sudden-node-loss.go (1)

786-796: Remove unused filterRunningPods function.

This function is never called in the codebase and should be removed to keep the code clean.

♻️ Proposed fix
-func filterRunningPods(pods []*pod.Builder) []*pod.Builder {
-	var running []*pod.Builder
-
-	for _, p := range pods {
-		if p.Object.Status.Phase == corev1.PodRunning {
-			running = append(running, p)
-		}
-	}
-
-	return running
-}
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go` around lines 786 - 796,
Remove the unused helper function filterRunningPods (which takes []*pod.Builder
and references corev1.PodRunning) from the file; locate the function definition
and delete it along with any unused imports that were only required by it (e.g.,
corev1 or pod.Builder if now unused) to keep the test file clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/rhwa/nhc-operator/tests/nhc.go`:
- Around line 34-45: The test currently discards the boolean return from
pod.WaitForAllPodsInNamespaceRunning which can be true with no pods matched;
update the call in the It block to capture the boolean (e.g., ok, err :=
pod.WaitForAllPodsInNamespaceRunning(APIClient, rhwaparams.RhwaOperatorNs,
rhwaparams.DefaultTimeout, listOptions)) and assert both that err is nil and ok
is true (e.g., Expect(err).ToNot(HaveOccurred()) and Expect(ok).To(BeTrue(), "no
controller pod matched selector for OperatorControllerPodLabel")), referencing
the OperatorControllerPodLabel, APIClient, rhwaparams.RhwaOperatorNs and
pod.WaitForAllPodsInNamespaceRunning symbols.

---

Nitpick comments:
In `@tests/rhwa/nhc-operator/README.md`:
- Line 23: Replace the unhyphenated phrase "bare metal" used as a compound
adjective with "bare-metal" in the README; specifically locate the occurrences
that modify "nodes" (the instance at the context around line 23 plus the other
occurrences referenced) and update them to "bare-metal nodes" to ensure
consistent compound-adjective hyphenation throughout the document.

In `@tests/rhwa/nhc-operator/tests/sudden-node-loss.go`:
- Around line 786-796: Remove the unused helper function filterRunningPods
(which takes []*pod.Builder and references corev1.PodRunning) from the file;
locate the function definition and delete it along with any unused imports that
were only required by it (e.g., corev1 or pod.Builder if now unused) to keep the
test file clean.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 936ce2e3-f861-4d01-b074-034b3f638ff4

📥 Commits

Reviewing files that changed from the base of the PR and between 1100fe0 and 3cc4874.

📒 Files selected for processing (12)
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/internal/rhwaconfig/rhwaconfig.go
  • tests/rhwa/internal/rhwaparams/rhwavars.go
  • tests/rhwa/nhc-operator/README.md
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/tests/nhc.go
  • tests/rhwa/nhc-operator/tests/sudden-node-loss.go
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/snr-operator/tests/snr.go
✅ Files skipped from review due to trivial changes (4)
  • tests/rhwa/snr-operator/internal/snrparams/snrvars.go
  • tests/rhwa/internal/rhwaconfig/default.yaml
  • tests/rhwa/nhc-operator/nhc_suite_test.go
  • tests/rhwa/nhc-operator/internal/nhcparams/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/rhwa/snr-operator/snr_suite_test.go
  • tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Comment thread tests/rhwa/nhc-operator/tests/nhc.go
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 2 times, most recently from ee44ec2 to 89b4045 Compare April 1, 2026 09:34
@rdiscala
Copy link
Copy Markdown
Collaborator Author

rdiscala commented Apr 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch 3 times, most recently from 55e2dd1 to ab9350a Compare April 1, 2026 10:31
Add a test suite that runs through the following scenario.

A healthy OpenShift cluster with a MNO topology experiences the sudden
shutdown of a worker node that is running a stateful application
(a simple Pod writing to a PV). The application is evicted and
rescheduled to another healthy node in order to provide business continuity.

Assisted-by: Claude
@rdiscala rdiscala force-pushed the CNF-17181-nhc-snr-sudden-loss-test branch from ab9350a to 290575b Compare April 1, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants