Skip to content

Commit afa9fdd

Browse files
committed
rhwa nhc: add planned-reboot-during-upgrade system test
Verify that the NHC operator does NOT trigger remediation when a worker node reboots as part of a planned OCP cluster upgrade. The test deploys a stateful app, initiates a cluster upgrade, and polls throughout the upgrade to confirm no SelfNodeRemediation resources are created. Post-upgrade checks verify all nodes are healthy, no out-of-service taints remain, and the app survived. Related: ECOPROJECT-2283 Co-Authored-By: Claude
1 parent f1793f3 commit afa9fdd

6 files changed

Lines changed: 683 additions & 32 deletions

File tree

tests/rhwa/internal/rhwaconfig/rhwaconfig.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ type RHWAConfig struct {
4545
StorageClass string `yaml:"nhc_storage_class" envconfig:"ECO_RHWA_NHC_STORAGE_CLASS"`
4646
AppImage string `yaml:"nhc_app_image" envconfig:"ECO_RHWA_NHC_APP_IMAGE"`
4747
TargetWorkerBMC BMCDetails `yaml:"nhc_target_worker_bmc" envconfig:"ECO_RHWA_NHC_TARGET_WORKER_BMC"`
48+
49+
// NHC planned-reboot (upgrade) test configuration.
50+
UpgradeImage string `yaml:"nhc_upgrade_image" envconfig:"ECO_RHWA_NHC_UPGRADE_IMAGE"`
51+
UpgradeChannel string `yaml:"nhc_upgrade_channel" envconfig:"ECO_RHWA_NHC_UPGRADE_CHANNEL"`
4852
}
4953

5054
// NewRHWAConfig returns instance of RHWA config type.

tests/rhwa/nhc-operator/README.md

Lines changed: 106 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,23 @@
44

55
NHC operator tests validate that the Node Health Check (NHC) and Self Node Remediation (SNR) operators
66
work together to detect unhealthy nodes and remediate them by fencing and evicting stateful workloads
7-
to healthy nodes.
7+
to healthy nodes — and, equally important, that they do **not** interfere with planned maintenance
8+
operations such as cluster upgrades.
89

9-
The first test scenario is **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected
10-
shutdown of a worker node running a stateful application. The NHC operator detects the node failure,
11-
creates a `SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to
12-
fence the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
13-
reattaching its persistent storage.
10+
There are two test scenarios:
11+
12+
1. **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected shutdown of a worker
13+
node running a stateful application. The NHC operator detects the node failure, creates a
14+
`SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to fence
15+
the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node,
16+
reattaching its persistent storage.
17+
18+
2. **Planned reboot of a node during cluster upgrade**: a cluster upgrade is initiated while a
19+
stateful application is running on a worker node. Worker nodes reboot as part of the
20+
MachineConfigPool rollout. The NHC operator detects the ongoing upgrade (by observing the
21+
difference between `currentConfig` and `desiredConfig` in the MCP) and does **not** trigger
22+
remediation. The test verifies that no `SelfNodeRemediation` resources are created during the
23+
entire upgrade process, and that the stateful application survives the upgrade.
1424

1525
### Prerequisites for running these tests:
1626

@@ -19,31 +29,37 @@ and configuration.
1929

2030
It has been run successfully on these OCP versions:
2131
- 4.19
32+
- 4.21
2233

23-
It has been tested on bare metal nodes. For virtualised infrastructure, a virtual BMC must be used,
24-
such as:
34+
#### Notes about the infrastructure
35+
36+
Both scenarios have been tested on bare-metal nodes. To run the **Sudden loss of a node** in a
37+
virtualised infrastructure, a virtual BMC must be used, such as:
2538

2639
- sushy-emulator (from the sushy project) — exposes a Redfish API that maps to libvirt VM power
2740
operations
2841
- VirtualBMC (vbmc) — maps IPMI commands to libvirt, though the test uses Redfish not IPMI
2942

3043
With the sushy-emulator running on the hypervisor, the ECO_RHWA_NHC_TARGET_WORKER_BMC environment
31-
variable must point at the sushy endpoint,
32-
e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`). The VMs must have a
44+
variable must point at the sushy endpoint,
45+
e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`. The VMs must have a
3346
watchdog device configured (e.g. i6300esb in libvirt), or set `isSoftwareRebootEnabled: true` as a
3447
fallback.
3548

3649
#### Cluster topology
3750

38-
* A Multi-Node OpenShift (MNO) cluster with **bare metal** or **virtualised** worker nodes
51+
* A Multi-Node OpenShift (MNO) cluster with **bare-metal** or **virtualised** worker nodes
3952
* At least **2 worker nodes** that will be used by the test (a target node and one or more
4053
failover nodes). The test labels the target node with `node-role.kubernetes.io/appworker`
4154
first to guarantee initial pod placement, then labels the failover nodes after the app is
4255
deployed. All labels are removed at the end
4356
* The target worker node must have **BMC/Redfish** (or iLO/IPMI) access for power control.
44-
The test powers it off via BMC to simulate sudden power loss and powers it back on at the end
57+
This is required by the **sudden-loss** test only (powers off the node via BMC to simulate
58+
sudden power loss and powers it back on at the end)
59+
60+
#### Sudden-loss remediation lifecycle
4561

46-
The test observes the full remediation lifecycle:
62+
The sudden-loss test observes the full remediation lifecycle:
4763

4864
1. Node `Ready` condition transitions to `Unknown` (~40s after power-off)
4965
2. NHC detects the unhealthy condition and creates a `SelfNodeRemediation` CR (~60s after condition change)
@@ -52,6 +68,18 @@ The test observes the full remediation lifecycle:
5268
5. The PVC is reattached and the pod becomes Ready on the new node
5369
6. The node is powered back on via BMC and returns to `Ready` state
5470

71+
#### Planned-reboot non-remediation lifecycle
72+
73+
The planned-reboot test observes the **absence** of remediation during a cluster upgrade:
74+
75+
1. A stateful application is deployed on a target worker node
76+
2. A cluster upgrade is initiated by patching the `ClusterVersion` resource
77+
3. Throughout the upgrade (~1.5–2.5 hours), the test polls every 30s to verify that no
78+
`SelfNodeRemediation` resources are created for any worker node
79+
4. After the upgrade completes, the test verifies that NHC reports all nodes healthy,
80+
no `out-of-service` taints exist, all cluster operators are available, and the stateful
81+
application survived
82+
5583
#### Operators
5684

5785
* **Node Health Check operator** (namespace: `openshift-workload-availability`)
@@ -91,9 +119,10 @@ can provide an up-to-date configuration and values for the settings above.
91119

92120
### Test suites:
93121

94-
| Name | Description |
95-
|------|-------------|
96-
| [sudden-node-loss](tests/sudden-node-loss.go) | Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
122+
| Name | Label | Description |
123+
|------|-------|-------------|
124+
| [sudden-node-loss](tests/sudden-node-loss.go) | `sudden-loss` | Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling |
125+
| [planned-node-reboot](tests/planned-node-reboot.go) | `planned-reboot` | Initiates a cluster upgrade and verifies NHC does **not** remediate during planned node reboots |
97126

98127
### Internal pkgs
99128

@@ -105,34 +134,67 @@ can provide an up-to-date configuration and values for the settings above.
105134

106135
Environment variables for test configuration:
107136

108-
- `ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to power off (must match the BMC address)
137+
#### Common (both tests)
138+
139+
- `ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to target
109140
- `ECO_RHWA_NHC_FAILOVER_WORKERS`: comma-separated list of worker FQDNs eligible for pod rescheduling
110141
- `ECO_RHWA_NHC_STORAGE_CLASS`: StorageClass name for the test PVC (e.g. `standard`)
111142
- `ECO_RHWA_NHC_APP_IMAGE`: container image for the stateful test application
143+
144+
#### Sudden-loss only
145+
112146
- `ECO_RHWA_NHC_TARGET_WORKER_BMC`: JSON object with BMC connection details, e.g. `{"address":"10.1.29.13","username":"user","password":"pass"}`
113147

148+
#### Planned-reboot only
149+
150+
- `ECO_RHWA_NHC_UPGRADE_IMAGE`: the target OCP release image for the upgrade (must be pre-mirrored in disconnected environments)
151+
- `ECO_RHWA_NHC_UPGRADE_CHANNEL`: the update channel (e.g. `stable-4.22`)
152+
114153
Please refer to the project README for a list of global inputs - [How to run](../../../README.md#how-to-run)
115154

116155
### Running NHC Test Suites
117156

157+
#### Running the sudden-loss test
158+
118159
```bash
119-
# export KUBECONFIG=</path/to/kubeconfig>
120-
# export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
121-
# export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
122-
# export ECO_RHWA_NHC_STORAGE_CLASS=standard
123-
# export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
124-
# export ECO_RHWA_NHC_TARGET_WORKER_BMC='{"address":"10.1.29.13","username":"admin","password":"secret"}'
125-
# make run-tests
160+
export KUBECONFIG=</path/to/kubeconfig>
161+
export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
162+
export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
163+
export ECO_RHWA_NHC_STORAGE_CLASS=standard
164+
export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
165+
export ECO_RHWA_NHC_TARGET_WORKER_BMC='{"address":"10.1.29.13","username":"admin","password":"secret"}'
166+
167+
go test ./tests/rhwa/nhc-operator/... -timeout=30m -ginkgo.label-filter="sudden-loss" -ginkgo.timeout=20m -v
126168
```
127169

128-
**Note on timeouts:** The `go test` command must use `-timeout` greater than the ginkgo timeout
129-
(e.g. `-timeout=30m` with `-ginkgo.timeout=20m`). If `go test` uses its default of 10 minutes,
130-
the Go test harness will kill the process before ginkgo can complete the test and run cleanup
131-
(AfterAll), which includes powering the node back on.
170+
#### Running the planned-reboot test
171+
172+
```bash
173+
export KUBECONFIG=</path/to/kubeconfig>
174+
export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com
175+
export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com
176+
export ECO_RHWA_NHC_STORAGE_CLASS=standard
177+
export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest
178+
export ECO_RHWA_NHC_UPGRADE_IMAGE=registry.example.com:5000/ocp/release:4.22.1
179+
export ECO_RHWA_NHC_UPGRADE_CHANNEL=stable-4.22
180+
181+
go test ./tests/rhwa/nhc-operator/... -timeout=180m -ginkgo.label-filter="planned-reboot" -ginkgo.timeout=170m -v
182+
```
183+
184+
**Note on timeouts:** The `go test` command must use `-timeout` greater than the ginkgo timeout.
185+
If `go test` uses its default of 10 minutes, the Go test harness will kill the process before
186+
ginkgo can complete the test and run cleanup (AfterAll).
187+
188+
**Important:** The planned-reboot test **upgrades the cluster** and this operation is
189+
**irreversible**. The upgrade target image must be pre-mirrored to the local registry in
190+
disconnected environments. Plan for 1.5–2.5 hours of runtime.
191+
192+
### Expected durations
132193

133-
**Expected duration:** A full sudden-node-loss run typically takes **11–15 minutes** end-to-end,
134-
broken down as follows (observed on a 4-worker bare metal cluster with `unhealthyConditions.duration=60s`
135-
and `safeTimeToAssumeNodeRebootedSeconds=180`):
194+
#### Sudden-loss test: ~11–15 minutes
195+
196+
Observed on a 4-worker bare-metal cluster with `unhealthyConditions.duration=60s`
197+
and `safeTimeToAssumeNodeRebootedSeconds=180`:
136198

137199
| Phase | Typical duration | Notes |
138200
|-------|-----------------|-------|
@@ -150,3 +212,16 @@ node can take several additional minutes to terminate, pushing Step 6 to 5–8 m
150212
case. Combined with the AfterAll node recovery, the total can reach ~17–20 minutes, which is why
151213
the ginkgo timeout is set to 20 minutes and the Go test timeout to 30 minutes.
152214

215+
#### Planned-reboot test: ~1.5–2.5 hours
216+
217+
The test is dominated by the cluster upgrade time. The test itself polls every 30 seconds and
218+
adds minimal overhead:
219+
220+
| Phase | Typical duration | Notes |
221+
|-------|-----------------|-------|
222+
| BeforeAll: Deploy app & verify placement | ~1 min | Same as sudden-loss |
223+
| Step 4: Initiate upgrade & wait for start | ~5 min | Patches ClusterVersion, waits for Progressing |
224+
| Step 5: Poll during upgrade | 1–2 hours | Polls every 30s for SNR resources (fail-fast) and upgrade completion |
225+
| Steps 6–7: Post-upgrade verification | ~5 min | NHC/SNR clean, cluster operators available, app healthy |
226+
| AfterAll: Namespace cleanup | ~1 min | Labels restored |
227+

tests/rhwa/nhc-operator/internal/nhcparams/const.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ const (
77
// LabelSuddenLoss is the label for the sudden-loss test scenario.
88
LabelSuddenLoss = "sudden-loss"
99

10+
// LabelPlannedReboot is the label for the planned-reboot test scenario.
11+
LabelPlannedReboot = "planned-reboot"
12+
1013
// NHCResourceName is the name of the NodeHealthCheck CR.
1114
NHCResourceName = "nhc-worker-self"
1215

tests/rhwa/nhc-operator/internal/nhcparams/nhcvars.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,4 +49,13 @@ var (
4949

5050
// BMCTimeout is the Redfish operation timeout.
5151
BMCTimeout = 6 * time.Minute
52+
53+
// UpgradeStartTimeout is how long to wait for a cluster upgrade to start.
54+
UpgradeStartTimeout = 5 * time.Minute
55+
56+
// UpgradeCompleteTimeout is how long to wait for a cluster upgrade to complete.
57+
UpgradeCompleteTimeout = 150 * time.Minute
58+
59+
// UpgradePollingInterval is the polling interval for upgrade observation.
60+
UpgradePollingInterval = 30 * time.Second
5261
)

tests/rhwa/nhc-operator/tests/nhc.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,12 +35,13 @@ var _ = Describe(
3535
listOptions := metav1.ListOptions{
3636
LabelSelector: fmt.Sprintf("app.kubernetes.io/name=%s", nhcparams.OperatorControllerPodLabel),
3737
}
38-
_, err := pod.WaitForAllPodsInNamespaceRunning(
38+
ok, err := pod.WaitForAllPodsInNamespaceRunning(
3939
APIClient,
4040
rhwaparams.RhwaOperatorNs,
4141
rhwaparams.DefaultTimeout,
4242
listOptions,
4343
)
4444
Expect(err).ToNot(HaveOccurred(), "Pod is not ready")
45+
Expect(ok).To(BeTrue(), "expected pods to be found and running")
4546
})
4647
})

0 commit comments

Comments
 (0)