|
| 1 | +# Medik8s - Kubernetes Node Remediation |
| 2 | + |
| 3 | +Medik8s is a project consists of several kubernetes operators that provide automatic node remediation and high availability for singleton workloads. |
| 4 | + |
| 5 | +Hardware is imperfect, and software contains bugs. When node level failures such as kernel hangs or dead NICs occur, the work required from the cluster does not decrease - workloads from affected nodes need to be restarted somewhere. |
| 6 | + |
| 7 | +However some workloads, such as RWO volumes and StatefulSets, may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if nodes (and the workloads running on them) are assumed to be dead whenever we stop hearing from them. For this reason it is important to know that the node has reached a safe state before initiating recovery of the workload. |
| 8 | + |
| 9 | +Unfortunately it is not always practical to require admin intervention in order to confirm the node’s true status. In order to automate the recovery of exclusive workloads, Medik8s presents a collection of projects that can be installed on any kubernetes-based cluster to automate: |
| 10 | +* The detection of failures, |
| 11 | +* Putting nodes into a safe state, |
| 12 | +* Allowing the scheduler to recover affected workloads |
| 13 | +* Attempting to restore cluster capacity |
| 14 | + |
| 15 | + |
| 16 | +## Prerequisites |
| 17 | + |
| 18 | +- A Kubernetes cluster. |
| 19 | + |
| 20 | + |
| 21 | +## Parameters |
| 22 | + |
| 23 | +To deploy the Medik8s pack, you need to review the following parameters in the pack's YAML and adjust if necessary |
| 24 | + |
| 25 | +| Name | Description | Type | Default Value | Required | |
| 26 | +| --- | --- | --- | --- | --- | |
| 27 | +| `medik8s.olm.install` | Whether to install the Operator Lifecycle Manager (OLM). If OLM is already installed in your cluster by other means, set this to `false` | Boolean | `true` | Yes | |
| 28 | +| `medik8s.olm.catalog.image` | Where to get the OperatorHub Catalog from. Adjust this if you use a private registry | String | quay.io/operatorhubio/catalog:latest | Yes | |
| 29 | +| `medik8s.nodeHealthChecks[*].spec.minHealthy` | The minimum percentage of nodes that must be healthy for the operator to act. | String | 51% | Yes | |
| 30 | +| `medik8s.nodeHealthChecks[*].spec.unhealthyConditions[*].duration` | Time that nodes can be unhealthy before the operator acts. | String | 300s | Yes | |
| 31 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.apiCheckInterval` | Frequency to check connectivity with each API server | String | 15s | Yes | |
| 32 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.apiServerTimeout` | Timeout to check connectivity with each API server. When this timeout elapses, the Operator starts remediation | String | 5s | Yes | |
| 33 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.hostPort` | HostPort is used for internal communication between SNR agents, do not change. | Integer | 30001 | Yes | |
| 34 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.isSoftwareRebootEnabled` | Specify if you want to enable software reboot of the unhealthy nodes | Boolean | true | Yes | |
| 35 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.maxApiErrorThreshold` | When reaching this threshold, the node starts contacting its peers | Integer | 3 | Yes | |
| 36 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.peerApiServerTimeout` | Timeout for the peer to connect the API server | String | 5s | Yes | |
| 37 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.peerDialTimeout` | Timeout for establishing connection with the peer | String | 5s | Yes | |
| 38 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.peerRequestTimeout` | Timeout to get a response from the peer | String | 5s | Yes | |
| 39 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.peerUpdateInterval` | Frequency to update peer information, such as IP address | String | 15m | Yes | |
| 40 | +| `medik8s.oselfNodeRemediationConfigs[*].spec.watchdogFilePath` | File path of the watchdog device in the nodes. If a watchdog device is unavailable, the SelfNodeRemediationConfig CR uses a software reboot | String | /dev/watchdog | Yes | |
| 41 | + |
| 42 | + |
| 43 | +Review the node failure detection information at the [Medik8s website](https://www.medik8s.io/failure_detection/) for more details on healthcheck behavior. |
| 44 | + |
| 45 | +## Upgrade |
| 46 | + |
| 47 | +This pack deploys an operator, which takes care of phased upgrades |
| 48 | + |
| 49 | + |
| 50 | +## Usage |
| 51 | + |
| 52 | +To use the Medik8s pack, first create a new [add-on cluster profile](https://docs.spectrocloud.com/profiles/cluster-profiles/create-cluster-profiles/create-addon-profile/), add a pack and search for the **Medik8s** pack in the Palete Community Registry. Then either accept the defaults or modify them as needed. |
| 53 | + |
| 54 | +In its default configuration, the Node Healthcheck Controller will detect node failures based on Kubernetes `unhealthyConditions` (if they endure for longer than the maximum `duration`) and timeouts set in the `selfNodeRemediationConfig`. |
| 55 | + |
| 56 | +Once a failure has been detected, remediation will start. You can fine tune the validation and remediation behavior by adjusting the `selfNodeRemediationConfig` section in the pack. Review the information about the [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/) on the Medik8s website. |
| 57 | + |
| 58 | +Once you have configured the pack, you can deploy it to cluster. |
| 59 | + |
| 60 | + |
| 61 | +## References |
| 62 | + |
| 63 | +- [Medik8s website](https://www.medik8s.io/failure_detection/) |
| 64 | +- [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/) |
0 commit comments