rhwa nhc: add planned-reboot-during-upgrade system test

rdiscala · rdiscala · commit afa9fdd420f7 · 2026-03-20T19:06:07.000+01:00
Verify that the NHC operator does NOT trigger remediation when
a worker node reboots as part of a planned OCP cluster upgrade.
The test deploys a stateful app, initiates a cluster upgrade,
and polls throughout the upgrade to confirm no SelfNodeRemediation
resources are created. Post-upgrade checks verify all nodes are
healthy, no out-of-service taints remain, and the app survived.

Related: ECOPROJECT-2283

Co-Authored-By: Claude
diff --git a/tests/rhwa/internal/rhwaconfig/rhwaconfig.go b/tests/rhwa/internal/rhwaconfig/rhwaconfig.go
@@ -45,6 +45,10 @@ type RHWAConfig struct {
 	StorageClass    string     `yaml:"nhc_storage_class" envconfig:"ECO_RHWA_NHC_STORAGE_CLASS"`
 	AppImage        string     `yaml:"nhc_app_image" envconfig:"ECO_RHWA_NHC_APP_IMAGE"`
 	TargetWorkerBMC BMCDetails `yaml:"nhc_target_worker_bmc" envconfig:"ECO_RHWA_NHC_TARGET_WORKER_BMC"`
+
+	// NHC planned-reboot (upgrade) test configuration.
+	UpgradeImage   string `yaml:"nhc_upgrade_image" envconfig:"ECO_RHWA_NHC_UPGRADE_IMAGE"`
+	UpgradeChannel string `yaml:"nhc_upgrade_channel" envconfig:"ECO_RHWA_NHC_UPGRADE_CHANNEL"`
 }
 
 // NewRHWAConfig returns instance of RHWA config type.
diff --git a/tests/rhwa/nhc-operator/README.md b/tests/rhwa/nhc-operator/README.md
@@ -4,13 +4,23 @@
 
 NHC operator tests validate that the Node Health Check (NHC) and Self Node Remediation (SNR) operators
 work together to detect unhealthy nodes and remediate them by fencing and evicting stateful workloads
-to healthy nodes.
+to healthy nodes — and, equally important, that they do **not** interfere with planned maintenance
+operations such as cluster upgrades.
 
-The first test scenario is **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected
-shutdown of a worker node running a stateful application. The NHC operator detects the node failure,
-creates a `SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to
-fence the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
-reattaching its persistent storage.
+There are two test scenarios:
+
+1. **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected shutdown of a worker
+   node running a stateful application. The NHC operator detects the node failure, creates a
+   `SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to fence
+   the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
+   reattaching its persistent storage.
+
+2. **Planned reboot of a node during cluster upgrade**: a cluster upgrade is initiated while a
+   stateful application is running on a worker node. Worker nodes reboot as part of the
+   MachineConfigPool rollout. The NHC operator detects the ongoing upgrade (by observing the
+   difference between `currentConfig` and `desiredConfig` in the MCP) and does **not** trigger
+   remediation. The test verifies that no `SelfNodeRemediation` resources are created during the
+   entire upgrade process, and that the stateful application survives the upgrade.
 
 ### Prerequisites for running these tests:
 
@@ -19,31 +29,37 @@ and configuration.
 
 It has been run successfully on these OCP versions:
 - 4.19
+- 4.21
 
-It has been tested on bare metal nodes. For virtualised infrastructure, a virtual BMC must be used, 
-such as:
+#### Notes about the infrastructure
+
+Both scenarios have been tested on bare-metal nodes. To run the **Sudden loss of a node** in a
+virtualised infrastructure, a virtual BMC must be used, such as:
 
   - sushy-emulator (from the sushy project) — exposes a Redfish API that maps to libvirt VM power
   operations
   - VirtualBMC (vbmc) — maps IPMI commands to libvirt, though the test uses Redfish not IPMI
 
 With the sushy-emulator running on the hypervisor, the ECO_RHWA_NHC_TARGET_WORKER_BMC environment
-variable must point at the sushy endpoint, 
-e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`). The VMs must have a
+variable must point at the sushy endpoint,
+e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`. The VMs must have a
 watchdog device configured (e.g. i6300esb in libvirt), or set `isSoftwareRebootEnabled: true` as a
 fallback.
 
 #### Cluster topology
 
-* A Multi-Node OpenShift (MNO) cluster with **bare metal** or **virtualised** worker nodes
+* A Multi-Node OpenShift (MNO) cluster with **bare-metal** or **virtualised** worker nodes
 * At least **2 worker nodes** that will be used by the test (a target node and one or more
   failover nodes). The test labels the target node with `node-role.kubernetes.io/appworker`
   first to guarantee initial pod placement, then labels the failover nodes after the app is
   deployed. All labels are removed at the end
 * The target worker node must have **BMC/Redfish** (or iLO/IPMI) access for power control.
-  The test powers it off via BMC to simulate sudden power loss and powers it back on at the end
+  This is required by the **sudden-loss** test only (powers off the node via BMC to simulate
+  sudden power loss and powers it back on at the end)
+
+#### Sudden-loss remediation lifecycle
 
-The test observes the full remediation lifecycle:
+The sudden-loss test observes the full remediation lifecycle:
 
 1. Node `Ready` condition transitions to `Unknown` (~40s after power-off)
 2. NHC detects the unhealthy condition and creates a `SelfNodeRemediation` CR (~60s after condition change)
@@ -52,6 +68,18 @@ The test observes the full remediation lifecycle:
 5. The PVC is reattached and the pod becomes Ready on the new node
 6. The node is powered back on via BMC and returns to `Ready` state
 
+#### Planned-reboot non-remediation lifecycle
+
+The planned-reboot test observes the **absence** of remediation during a cluster upgrade:
+
+1. A stateful application is deployed on a target worker node
+2. A cluster upgrade is initiated by patching the `ClusterVersion` resource
+3. Throughout the upgrade (~1.5–2.5 hours), the test polls every 30s to verify that no
+   `SelfNodeRemediation` resources are created for any worker node
+4. After the upgrade completes, the test verifies that NHC reports all nodes healthy,
+   no `out-of-service` taints exist, all cluster operators are available, and the stateful
+   application survived
+
 #### Operators
 
 * **Node Health Check operator** (namespace: `openshift-workload-availability`)
@@ -91,9 +119,10 @@ can provide an up-to-date configuration and values for the settings above.
 
 ### Test suites:
 
-| Name | Description |
-|------|-------------|
-| [sudden-node-loss](tests/sudden-node-loss.go) | Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
+| Name | Label | Description |
+|------|-------|-------------|
+| [sudden-node-loss](tests/sudden-node-loss.go) | `sudden-loss` | Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
+| [planned-node-reboot](tests/planned-node-reboot.go) | `planned-reboot` | Initiates a cluster upgrade and verifies NHC does **not** remediate during planned node reboots |
 
 ### Internal pkgs
 
@@ -105,34 +134,67 @@ can provide an up-to-date configuration and values for the settings above.
 
 Environment variables for test configuration:
 
-- `ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to power off (must match the BMC address)
+#### Common (both tests)
+
+- `ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to target
 - `ECO_RHWA_NHC_FAILOVER_WORKERS`: comma-separated list of worker FQDNs eligible for pod rescheduling
 - `ECO_RHWA_NHC_STORAGE_CLASS`: StorageClass name for the test PVC (e.g. `standard`)
 - `ECO_RHWA_NHC_APP_IMAGE`: container image for the stateful test application
+
+#### Sudden-loss only
+
 - `ECO_RHWA_NHC_TARGET_WORKER_BMC`: JSON object with BMC connection details, e.g. `{"address":"10.1.29.13","username":"user","password":"pass"}`
 
+#### Planned-reboot only
+
+- `ECO_RHWA_NHC_UPGRADE_IMAGE`: the target OCP release image for the upgrade (must be pre-mirrored in disconnected environments)
+- `ECO_RHWA_NHC_UPGRADE_CHANNEL`: the update channel (e.g. `stable-4.22`)
+
 Please refer to the project README for a list of global inputs - [How to run](../../../README.md#how-to-run)
 
 ### Running NHC Test Suites
 
+#### Running the sudden-loss test
+
 ```bash
-# export KUBECONFIG=</path/to/kubeconfig>
-# export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
-# export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
-# export ECO_RHWA_NHC_STORAGE_CLASS=standard
-# export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
-# export ECO_RHWA_NHC_TARGET_WORKER_BMC='{"address":"10.1.29.13","username":"admin","password":"secret"}'
-# make run-tests
+export KUBECONFIG=</path/to/kubeconfig>
+export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
+export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
+export ECO_RHWA_NHC_STORAGE_CLASS=standard
+export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
+export ECO_RHWA_NHC_TARGET_WORKER_BMC='{"address":"10.1.29.13","username":"admin","password":"secret"}'
+
+go test ./tests/rhwa/nhc-operator/... -timeout=30m -ginkgo.label-filter="sudden-loss" -ginkgo.timeout=20m -v
 ```
 
-**Note on timeouts:** The `go test` command must use `-timeout` greater than the ginkgo timeout
-(e.g. `-timeout=30m` with `-ginkgo.timeout=20m`). If `go test` uses its default of 10 minutes,
-the Go test harness will kill the process before ginkgo can complete the test and run cleanup
-(AfterAll), which includes powering the node back on.
+#### Running the planned-reboot test
+
+```bash
+export KUBECONFIG=</path/to/kubeconfig>
+export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
+export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
+export ECO_RHWA_NHC_STORAGE_CLASS=standard
+export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
+export ECO_RHWA_NHC_UPGRADE_IMAGE=registry.example.com:5000/ocp/release:4.22.1
+export ECO_RHWA_NHC_UPGRADE_CHANNEL=stable-4.22
+
+go test ./tests/rhwa/nhc-operator/... -timeout=180m -ginkgo.label-filter="planned-reboot" -ginkgo.timeout=170m -v
+```
+
+**Note on timeouts:** The `go test` command must use `-timeout` greater than the ginkgo timeout.
+If `go test` uses its default of 10 minutes, the Go test harness will kill the process before
+ginkgo can complete the test and run cleanup (AfterAll).
+
+**Important:** The planned-reboot test **upgrades the cluster** and this operation is
+**irreversible**. The upgrade target image must be pre-mirrored to the local registry in
+disconnected environments. Plan for 1.5–2.5 hours of runtime.
+
+### Expected durations
 
-**Expected duration:** A full sudden-node-loss run typically takes **11–15 minutes** end-to-end,
-broken down as follows (observed on a 4-worker bare metal cluster with `unhealthyConditions.duration=60s`
-and `safeTimeToAssumeNodeRebootedSeconds=180`):
+#### Sudden-loss test: ~11–15 minutes
+
+Observed on a 4-worker bare-metal cluster with `unhealthyConditions.duration=60s`
+and `safeTimeToAssumeNodeRebootedSeconds=180`:
 
 | Phase | Typical duration | Notes |
 |-------|-----------------|-------|
@@ -150,3 +212,16 @@ node can take several additional minutes to terminate, pushing Step 6 to 5–8 m
 case. Combined with the AfterAll node recovery, the total can reach ~17–20 minutes, which is why
 the ginkgo timeout is set to 20 minutes and the Go test timeout to 30 minutes.
 
+#### Planned-reboot test: ~1.5–2.5 hours
+
+The test is dominated by the cluster upgrade time. The test itself polls every 30 seconds and
+adds minimal overhead:
+
+| Phase | Typical duration | Notes |
+|-------|-----------------|-------|
+| BeforeAll: Deploy app & verify placement | ~1 min | Same as sudden-loss |
+| Step 4: Initiate upgrade & wait for start | ~5 min | Patches ClusterVersion, waits for Progressing |
+| Step 5: Poll during upgrade | 1–2 hours | Polls every 30s for SNR resources (fail-fast) and upgrade completion |
+| Steps 6–7: Post-upgrade verification | ~5 min | NHC/SNR clean, cluster operators available, app healthy |
+| AfterAll: Namespace cleanup | ~1 min | Labels restored |
+
diff --git a/tests/rhwa/nhc-operator/internal/nhcparams/const.go b/tests/rhwa/nhc-operator/internal/nhcparams/const.go
@@ -7,6 +7,9 @@ const (
 	// LabelSuddenLoss is the label for the sudden-loss test scenario.
 	LabelSuddenLoss = "sudden-loss"
 
+	// LabelPlannedReboot is the label for the planned-reboot test scenario.
+	LabelPlannedReboot = "planned-reboot"
+
 	// NHCResourceName is the name of the NodeHealthCheck CR.
 	NHCResourceName = "nhc-worker-self"
 
diff --git a/tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go b/tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go
@@ -49,4 +49,13 @@ var (
 
 	// BMCTimeout is the Redfish operation timeout.
 	BMCTimeout = 6 * time.Minute
+
+	// UpgradeStartTimeout is how long to wait for a cluster upgrade to start.
+	UpgradeStartTimeout = 5 * time.Minute
+
+	// UpgradeCompleteTimeout is how long to wait for a cluster upgrade to complete.
+	UpgradeCompleteTimeout = 150 * time.Minute
+
+	// UpgradePollingInterval is the polling interval for upgrade observation.
+	UpgradePollingInterval = 30 * time.Second
 )
diff --git a/tests/rhwa/nhc-operator/tests/nhc.go b/tests/rhwa/nhc-operator/tests/nhc.go
@@ -35,12 +35,13 @@ var _ = Describe(
 			listOptions := metav1.ListOptions{
 				LabelSelector: fmt.Sprintf("app.kubernetes.io/name=%s", nhcparams.OperatorControllerPodLabel),
 			}
-			_, err := pod.WaitForAllPodsInNamespaceRunning(
+			ok, err := pod.WaitForAllPodsInNamespaceRunning(
 				APIClient,
 				rhwaparams.RhwaOperatorNs,
 				rhwaparams.DefaultTimeout,
 				listOptions,
 			)
 			Expect(err).ToNot(HaveOccurred(), "Pod is not ready")
+			Expect(ok).To(BeTrue(), "expected pods to be found and running")
 		})
 	})
diff --git a/tests/rhwa/nhc-operator/tests/planned-node-reboot.go b/tests/rhwa/nhc-operator/tests/planned-node-reboot.go