You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rhwa nhc: add planned-reboot-during-upgrade system test
Verify that the NHC operator does NOT trigger remediation when
a worker node reboots as part of a planned OCP cluster upgrade.
The test deploys a stateful app, initiates a cluster upgrade,
and polls throughout the upgrade to confirm no SelfNodeRemediation
resources are created. Post-upgrade checks verify all nodes are
healthy, no out-of-service taints remain, and the app survived.
Related: ECOPROJECT-2283
Co-Authored-By: Claude
NHC operator tests validate that the Node Health Check (NHC) and Self Node Remediation (SNR) operators
6
6
work together to detect unhealthy nodes and remediate them by fencing and evicting stateful workloads
7
-
to healthy nodes.
7
+
to healthy nodes — and, equally important, that they do **not** interfere with planned maintenance
8
+
operations such as cluster upgrades.
8
9
9
-
The first test scenario is **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected
10
-
shutdown of a worker node running a stateful application. The NHC operator detects the node failure,
11
-
creates a `SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to
12
-
fence the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
13
-
reattaching its persistent storage.
10
+
There are two test scenarios:
11
+
12
+
1.**Sudden loss of a node**: a healthy MNO cluster experiences the unexpected shutdown of a worker
13
+
node running a stateful application. The NHC operator detects the node failure, creates a
14
+
`SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to fence
15
+
the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
16
+
reattaching its persistent storage.
17
+
18
+
2.**Planned reboot of a node during cluster upgrade**: a cluster upgrade is initiated while a
19
+
stateful application is running on a worker node. Worker nodes reboot as part of the
20
+
MachineConfigPool rollout. The NHC operator detects the ongoing upgrade (by observing the
21
+
difference between `currentConfig` and `desiredConfig` in the MCP) and does **not** trigger
22
+
remediation. The test verifies that no `SelfNodeRemediation` resources are created during the
23
+
entire upgrade process, and that the stateful application survives the upgrade.
14
24
15
25
### Prerequisites for running these tests:
16
26
@@ -19,31 +29,37 @@ and configuration.
19
29
20
30
It has been run successfully on these OCP versions:
21
31
- 4.19
32
+
- 4.21
22
33
23
-
It has been tested on bare metal nodes. For virtualised infrastructure, a virtual BMC must be used,
24
-
such as:
34
+
#### Notes about the infrastructure
35
+
36
+
Both scenarios have been tested on bare-metal nodes. To run the **Sudden loss of a node** in a
37
+
virtualised infrastructure, a virtual BMC must be used, such as:
25
38
26
39
- sushy-emulator (from the sushy project) — exposes a Redfish API that maps to libvirt VM power
27
40
operations
28
41
- VirtualBMC (vbmc) — maps IPMI commands to libvirt, though the test uses Redfish not IPMI
29
42
30
43
With the sushy-emulator running on the hypervisor, the ECO_RHWA_NHC_TARGET_WORKER_BMC environment
31
-
variable must point at the sushy endpoint,
32
-
e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`). The VMs must have a
44
+
variable must point at the sushy endpoint,
45
+
e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`. The VMs must have a
33
46
watchdog device configured (e.g. i6300esb in libvirt), or set `isSoftwareRebootEnabled: true` as a
34
47
fallback.
35
48
36
49
#### Cluster topology
37
50
38
-
* A Multi-Node OpenShift (MNO) cluster with **baremetal** or **virtualised** worker nodes
51
+
* A Multi-Node OpenShift (MNO) cluster with **bare-metal** or **virtualised** worker nodes
39
52
* At least **2 worker nodes** that will be used by the test (a target node and one or more
40
53
failover nodes). The test labels the target node with `node-role.kubernetes.io/appworker`
41
54
first to guarantee initial pod placement, then labels the failover nodes after the app is
42
55
deployed. All labels are removed at the end
43
56
* The target worker node must have **BMC/Redfish** (or iLO/IPMI) access for power control.
44
-
The test powers it off via BMC to simulate sudden power loss and powers it back on at the end
57
+
This is required by the **sudden-loss** test only (powers off the node via BMC to simulate
58
+
sudden power loss and powers it back on at the end)
59
+
60
+
#### Sudden-loss remediation lifecycle
45
61
46
-
The test observes the full remediation lifecycle:
62
+
The sudden-loss test observes the full remediation lifecycle:
47
63
48
64
1. Node `Ready` condition transitions to `Unknown` (~40s after power-off)
49
65
2. NHC detects the unhealthy condition and creates a `SelfNodeRemediation` CR (~60s after condition change)
@@ -52,6 +68,18 @@ The test observes the full remediation lifecycle:
52
68
5. The PVC is reattached and the pod becomes Ready on the new node
53
69
6. The node is powered back on via BMC and returns to `Ready` state
54
70
71
+
#### Planned-reboot non-remediation lifecycle
72
+
73
+
The planned-reboot test observes the **absence** of remediation during a cluster upgrade:
74
+
75
+
1. A stateful application is deployed on a target worker node
76
+
2. A cluster upgrade is initiated by patching the `ClusterVersion` resource
77
+
3. Throughout the upgrade (~1.5–2.5 hours), the test polls every 30s to verify that no
78
+
`SelfNodeRemediation` resources are created for any worker node
79
+
4. After the upgrade completes, the test verifies that NHC reports all nodes healthy,
80
+
no `out-of-service` taints exist, all cluster operators are available, and the stateful
81
+
application survived
82
+
55
83
#### Operators
56
84
57
85
***Node Health Check operator** (namespace: `openshift-workload-availability`)
@@ -91,9 +119,10 @@ can provide an up-to-date configuration and values for the settings above.
91
119
92
120
### Test suites:
93
121
94
-
| Name | Description |
95
-
|------|-------------|
96
-
|[sudden-node-loss](tests/sudden-node-loss.go)| Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
122
+
| Name | Label | Description |
123
+
|------|-------|-------------|
124
+
|[sudden-node-loss](tests/sudden-node-loss.go)|`sudden-loss`| Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
125
+
|[planned-node-reboot](tests/planned-node-reboot.go)|`planned-reboot`| Initiates a cluster upgrade and verifies NHC does **not** remediate during planned node reboots |
97
126
98
127
### Internal pkgs
99
128
@@ -105,34 +134,67 @@ can provide an up-to-date configuration and values for the settings above.
105
134
106
135
Environment variables for test configuration:
107
136
108
-
-`ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to power off (must match the BMC address)
137
+
#### Common (both tests)
138
+
139
+
-`ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to target
109
140
-`ECO_RHWA_NHC_FAILOVER_WORKERS`: comma-separated list of worker FQDNs eligible for pod rescheduling
110
141
-`ECO_RHWA_NHC_STORAGE_CLASS`: StorageClass name for the test PVC (e.g. `standard`)
111
142
-`ECO_RHWA_NHC_APP_IMAGE`: container image for the stateful test application
143
+
144
+
#### Sudden-loss only
145
+
112
146
-`ECO_RHWA_NHC_TARGET_WORKER_BMC`: JSON object with BMC connection details, e.g. `{"address":"10.1.29.13","username":"user","password":"pass"}`
113
147
148
+
#### Planned-reboot only
149
+
150
+
-`ECO_RHWA_NHC_UPGRADE_IMAGE`: the target OCP release image for the upgrade (must be pre-mirrored in disconnected environments)
151
+
-`ECO_RHWA_NHC_UPGRADE_CHANNEL`: the update channel (e.g. `stable-4.22`)
152
+
114
153
Please refer to the project README for a list of global inputs - [How to run](../../../README.md#how-to-run)
0 commit comments