|
| 1 | +# RHWA Team - Node Health Check Operator |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +NHC operator tests validate that the Node Health Check (NHC) and Self Node Remediation (SNR) operators |
| 6 | +work together to detect unhealthy nodes and remediate them by fencing and evicting stateful workloads |
| 7 | +to healthy nodes. |
| 8 | + |
| 9 | +The first test scenario is **Sudden loss of a node**: a healthy MNO cluster experiences the unexpected |
| 10 | +shutdown of a worker node running a stateful application. The NHC operator detects the node failure, |
| 11 | +creates a `SelfNodeRemediation` resource, and the SNR operator applies an `out-of-service` taint to |
| 12 | +fence the node. Kubernetes then force-evicts the stateful pod and reschedules it on a healthy node, |
| 13 | +reattaching its persistent storage. |
| 14 | + |
| 15 | +### Prerequisites for running these tests: |
| 16 | + |
| 17 | +The test suite is designed to run on an OCP cluster version 4.19+ with the following components |
| 18 | +and configuration. |
| 19 | + |
| 20 | +It has been run successfully on these OCP versions: |
| 21 | +- 4.19 |
| 22 | + |
| 23 | +It has been tested on bare-metal nodes. For virtualised infrastructure, a virtual BMC must be used, |
| 24 | +such as: |
| 25 | + |
| 26 | + - sushy-emulator (from the sushy project) — exposes a Redfish API that maps to libvirt VM power |
| 27 | + operations |
| 28 | + - VirtualBMC (vbmc) — maps IPMI commands to libvirt, though the test uses Redfish not IPMI |
| 29 | + |
| 30 | +With the sushy-emulator running on the hypervisor, the ECO_RHWA_NHC_TARGET_WORKER_BMC environment |
| 31 | +variable must point at the sushy endpoint, |
| 32 | +e.g. `{"address":"hypervisor:8000","username":"admin","password":"password"}`). The VMs must have a |
| 33 | +watchdog device configured (e.g. i6300esb in libvirt), or set `isSoftwareRebootEnabled: true` as a |
| 34 | +fallback. |
| 35 | + |
| 36 | +#### Cluster topology |
| 37 | + |
| 38 | +* A Multi-Node OpenShift (MNO) cluster with **bare-metal** or **virtualised** worker nodes |
| 39 | +* At least **2 worker nodes** that will be used by the test (a target node and one or more |
| 40 | + failover nodes). The test labels the target node with `node-role.kubernetes.io/appworker` |
| 41 | + first to guarantee initial pod placement, then labels the failover nodes after the app is |
| 42 | + deployed. All labels are removed at the end |
| 43 | +* The target worker node must have **BMC/Redfish** (or iLO/IPMI) access for power control. |
| 44 | + The test powers it off via BMC to simulate sudden power loss and powers it back on at the end |
| 45 | + |
| 46 | +The test observes the full remediation lifecycle: |
| 47 | + |
| 48 | +1. Node `Ready` condition transitions to `Unknown` (~40s after power-off) |
| 49 | +2. NHC detects the unhealthy condition and creates a `SelfNodeRemediation` CR (~60s after condition change) |
| 50 | +3. SNR fences the node with an `out-of-service` taint (~180s after `safeTimeToAssumeNodeRebootedSeconds`) |
| 51 | +4. The stateful pod is evicted and rescheduled on a healthy node |
| 52 | +5. The PVC is reattached and the pod becomes Ready on the new node |
| 53 | +6. The node is powered back on via BMC and returns to `Ready` state |
| 54 | + |
| 55 | +#### Operators |
| 56 | + |
| 57 | +* **Node Health Check operator** (namespace: `openshift-workload-availability`) |
| 58 | +* **Self Node Remediation operator** (installed as default remediation provider by NHC) |
| 59 | + |
| 60 | +#### Operator configuration |
| 61 | + |
| 62 | +* A `SelfNodeRemediationTemplate` CR with `remediationStrategy: OutOfServiceTaint` |
| 63 | +* A `NodeHealthCheck` CR (named `nhc-worker-self`) configured with: |
| 64 | + * A `selector` matching the worker nodes monitored by NHC (e.g. `node-role.kubernetes.io/worker`). |
| 65 | + The selector must match the target and failover nodes |
| 66 | + * `minHealthy` set to a value that is **still satisfied** when one node goes down. |
| 67 | + For example, with 4 workers under NHC, use `75%` — losing 1 node leaves 3/4 = 75% healthy, |
| 68 | + which meets the threshold. If `minHealthy` is too high (e.g. `90%` with 4 nodes requires |
| 69 | + all 4 healthy), NHC will not remediate |
| 70 | + * `unhealthyConditions` with `duration: 60s` for `Ready` in `False` and `Unknown` status |
| 71 | + * A `remediationTemplate` pointing to the `SelfNodeRemediationTemplate` above |
| 72 | +* A `SelfNodeRemediationConfig` CR with `safeTimeToAssumeNodeRebootedSeconds: 180` |
| 73 | + |
| 74 | +The [Telco Reference CRs](https://github.com/openshift-kni/telco-reference/) |
| 75 | +can provide an up-to-date configuration and values for the settings above. |
| 76 | + |
| 77 | +#### Storage |
| 78 | + |
| 79 | +* A **StorageClass** capable of dynamically provisioning `ReadWriteOnce` PersistentVolumes |
| 80 | + (e.g. NFS-based). The test creates a 1Gi PVC for the stateful application. The storage |
| 81 | + must support volume reattachment to a different node after the original node is fenced |
| 82 | +* The test verifies `VolumeAttachment` resources for CSI-backed storage. For non-CSI storage |
| 83 | + (e.g. NFS), this check is skipped — the PVC being Bound and the pod Running on the new node |
| 84 | + is sufficient verification |
| 85 | + |
| 86 | +#### Container image |
| 87 | + |
| 88 | +* A container image accessible from the cluster (e.g. `ubi-minimal`). In disconnected |
| 89 | + environments, mirror it to the local registry. The test uses this image to run a simple |
| 90 | + heartbeat loop as the stateful application |
| 91 | + |
| 92 | +### Test suites: |
| 93 | + |
| 94 | +| Name | Description | |
| 95 | +|------|-------------| |
| 96 | +| [sudden-node-loss](tests/sudden-node-loss.go) | Powers off a worker node via BMC and verifies NHC/SNR remediation and pod rescheduling | |
| 97 | + |
| 98 | +### Internal pkgs |
| 99 | + |
| 100 | +| Name | Description | |
| 101 | +|------|-------------| |
| 102 | +| [nhcparams](internal/nhcparams/const.go) | Constants, labels, timeouts, and reporter configuration for NHC tests | |
| 103 | + |
| 104 | +### Inputs |
| 105 | + |
| 106 | +Environment variables for test configuration: |
| 107 | + |
| 108 | +- `ECO_RHWA_NHC_TARGET_WORKER`: FQDN of the worker node to power off (must match the BMC address) |
| 109 | +- `ECO_RHWA_NHC_FAILOVER_WORKERS`: comma-separated list of worker FQDNs eligible for pod rescheduling |
| 110 | +- `ECO_RHWA_NHC_STORAGE_CLASS`: StorageClass name for the test PVC (e.g. `standard`) |
| 111 | +- `ECO_RHWA_NHC_APP_IMAGE`: container image for the stateful test application |
| 112 | +- `ECO_RHWA_NHC_TARGET_WORKER_BMC`: JSON object with BMC connection details, e.g. `{"address":"10.1.29.13","username":"user","password":"pass"}` |
| 113 | + |
| 114 | +Please refer to the project README for a list of global inputs - [How to run](../../../README.md#how-to-run) |
| 115 | + |
| 116 | +### Running NHC Test Suites |
| 117 | + |
| 118 | +```bash |
| 119 | +# export KUBECONFIG=</path/to/kubeconfig> |
| 120 | +# export ECO_RHWA_NHC_TARGET_WORKER=openshift-worker-0.example.com |
| 121 | +# export ECO_RHWA_NHC_FAILOVER_WORKERS=openshift-worker-1.example.com |
| 122 | +# export ECO_RHWA_NHC_STORAGE_CLASS=standard |
| 123 | +# export ECO_RHWA_NHC_APP_IMAGE=registry.example.com:5000/test/ubi-minimal:latest |
| 124 | +# export ECO_RHWA_NHC_TARGET_WORKER_BMC='{"address":"10.1.29.13","username":"admin","password":"secret"}' |
| 125 | +# make run-tests |
| 126 | +``` |
| 127 | + |
| 128 | +**Note on timeouts:** The `go test` command must use `-timeout` greater than the ginkgo timeout |
| 129 | +(e.g. `-timeout=30m` with `-ginkgo.timeout=20m`). If `go test` uses its default of 10 minutes, |
| 130 | +the Go test harness will kill the process before ginkgo can complete the test and run cleanup |
| 131 | +(AfterAll), which includes powering the node back on. |
| 132 | + |
| 133 | +**Expected duration:** A full sudden-node-loss run typically takes **11–15 minutes** end-to-end, |
| 134 | +broken down as follows (observed on a 4-worker bare-metal cluster with `unhealthyConditions.duration=60s` |
| 135 | +and `safeTimeToAssumeNodeRebootedSeconds=180`): |
| 136 | + |
| 137 | +| Phase | Typical duration | Notes | |
| 138 | +|-------|-----------------|-------| |
| 139 | +| Step 3: Deploy app & verify placement | ~10s | PVC binding + pod scheduling | |
| 140 | +| Step 4: Power off node & detect failure | ~50s | ~40s for kubelet heartbeat timeout | |
| 141 | +| Step 5: NHC marks unhealthy & creates SNR | ~60s | Matches `unhealthyConditions.duration` | |
| 142 | +| Step 6: SNR fences node (out-of-service taint) | 3–5 min | 180s fence timer + SNR waits for all pods on the dead node to finish terminating; system pods like `dns-default` can extend this | |
| 143 | +| Step 7: Verify rescheduling | < 1s | Pod is rescheduled as soon as taint is applied | |
| 144 | +| AfterAll: Power on node & wait for Ready | ~5 min | Bare metal boot + kubelet registration | |
| 145 | + |
| 146 | +Step 6 is the most variable: after the 180s `safeTimeToAssumeNodeRebootedSeconds` timer expires, |
| 147 | +the SNR operator waits for all terminating pods on the fenced node to complete deletion before |
| 148 | +marking fencing as complete. System pods (e.g. `dns-default`, `ingress-canary`) on an unreachable |
| 149 | +node can take several additional minutes to terminate, pushing Step 6 to 5–8 minutes in the worst |
| 150 | +case. Combined with the AfterAll node recovery, the total can reach ~17–20 minutes, which is why |
| 151 | +the ginkgo timeout is set to 20 minutes and the Go test timeout to 30 minutes. |
| 152 | + |
0 commit comments