Skip to content

Commit 68119ea

Browse files
authored
MediK8s node remediation (#113)
* MediK8s node remediation Signed-off-by: Kevin Reeuwijk <[email protected]> * Update packa and add Readme * Add CRD for SelfNodeRemediationConfig * Add pack images content --------- Signed-off-by: Kevin Reeuwijk <[email protected]>
1 parent 414e178 commit 68119ea

20 files changed

+15884
-0
lines changed

packs/medik8s-1.0.0/README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Medik8s - Kubernetes Node Remediation
2+
3+
Medik8s is a project consists of several kubernetes operators that provide automatic node remediation and high availability for singleton workloads.
4+
5+
Hardware is imperfect, and software contains bugs. When node level failures such as kernel hangs or dead NICs occur, the work required from the cluster does not decrease - workloads from affected nodes need to be restarted somewhere.
6+
7+
However some workloads, such as RWO volumes and StatefulSets, may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if nodes (and the workloads running on them) are assumed to be dead whenever we stop hearing from them. For this reason it is important to know that the node has reached a safe state before initiating recovery of the workload.
8+
9+
Unfortunately it is not always practical to require admin intervention in order to confirm the node’s true status. In order to automate the recovery of exclusive workloads, Medik8s presents a collection of projects that can be installed on any kubernetes-based cluster to automate:
10+
* The detection of failures,
11+
* Putting nodes into a safe state,
12+
* Allowing the scheduler to recover affected workloads
13+
* Attempting to restore cluster capacity
14+
15+
16+
## Prerequisites
17+
18+
- A Kubernetes cluster.
19+
20+
21+
## Parameters
22+
23+
To deploy the Medik8s pack, you need to review the following parameters in the pack's YAML and adjust if necessary
24+
25+
| Name | Description | Type | Default Value | Required |
26+
| --- | --- | --- | --- | --- |
27+
| `medik8s.olm.install` | Whether to install the Operator Lifecycle Manager (OLM). If OLM is already installed in your cluster by other means, set this to `false` | Boolean | `true` | Yes |
28+
| `medik8s.olm.catalog.image` | Where to get the OperatorHub Catalog from. Adjust this if you use a private registry | String | quay.io/operatorhubio/catalog:latest | Yes |
29+
| `medik8s.nodeHealthChecks[*].spec.minHealthy` | The minimum percentage of nodes that must be healthy for the operator to act. | String | 51% | Yes |
30+
| `medik8s.nodeHealthChecks[*].spec.unhealthyConditions[*].duration` | Time that nodes can be unhealthy before the operator acts. | String | 300s | Yes |
31+
| `medik8s.oselfNodeRemediationConfigs[*].spec.apiCheckInterval` | Frequency to check connectivity with each API server | String | 15s | Yes |
32+
| `medik8s.oselfNodeRemediationConfigs[*].spec.apiServerTimeout` | Timeout to check connectivity with each API server. When this timeout elapses, the Operator starts remediation | String | 5s | Yes |
33+
| `medik8s.oselfNodeRemediationConfigs[*].spec.hostPort` | HostPort is used for internal communication between SNR agents, do not change. | Integer | 30001 | Yes |
34+
| `medik8s.oselfNodeRemediationConfigs[*].spec.isSoftwareRebootEnabled` | Specify if you want to enable software reboot of the unhealthy nodes | Boolean | true | Yes |
35+
| `medik8s.oselfNodeRemediationConfigs[*].spec.maxApiErrorThreshold` | When reaching this threshold, the node starts contacting its peers | Integer | 3 | Yes |
36+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerApiServerTimeout` | Timeout for the peer to connect the API server | String | 5s | Yes |
37+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerDialTimeout` | Timeout for establishing connection with the peer | String | 5s | Yes |
38+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerRequestTimeout` | Timeout to get a response from the peer | String | 5s | Yes |
39+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerUpdateInterval` | Frequency to update peer information, such as IP address | String | 15m | Yes |
40+
| `medik8s.oselfNodeRemediationConfigs[*].spec.watchdogFilePath` | File path of the watchdog device in the nodes. If a watchdog device is unavailable, the SelfNodeRemediationConfig CR uses a software reboot | String | /dev/watchdog | Yes |
41+
42+
43+
Review the node failure detection information at the [Medik8s website](https://www.medik8s.io/failure_detection/) for more details on healthcheck behavior.
44+
45+
## Upgrade
46+
47+
This pack deploys an operator, which takes care of phased upgrades
48+
49+
50+
## Usage
51+
52+
To use the Medik8s pack, first create a new [add-on cluster profile](https://docs.spectrocloud.com/profiles/cluster-profiles/create-cluster-profiles/create-addon-profile/), add a pack and search for the **Medik8s** pack in the Palete Community Registry. Then either accept the defaults or modify them as needed.
53+
54+
In its default configuration, the Node Healthcheck Controller will detect node failures based on Kubernetes `unhealthyConditions` (if they endure for longer than the maximum `duration`) and timeouts set in the `selfNodeRemediationConfig`.
55+
56+
Once a failure has been detected, remediation will start. You can fine tune the validation and remediation behavior by adjusting the `selfNodeRemediationConfig` section in the pack. Review the information about the [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/) on the Medik8s website.
57+
58+
Once you have configured the pack, you can deploy it to cluster.
59+
60+
61+
## References
62+
63+
- [Medik8s website](https://www.medik8s.io/failure_detection/)
64+
- [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/)
116 KB
Binary file not shown.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: v2
2+
name: medik8s
3+
description: Kubernetes Node Remediation
4+
# A chart can be either an 'application' or a 'library' chart.
5+
#
6+
# Application charts are a collection of templates that can be packaged into versioned archives
7+
# to be deployed.
8+
#
9+
# Library charts provide useful utilities or functions for the chart developer. They're included as
10+
# a dependency of application charts to inject those utilities and functions into the rendering
11+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
12+
type: application
13+
14+
# This is the chart version. This version number should be incremented each time you make changes
15+
# to the chart and its templates, including the app version.
16+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
17+
version: 1.0.0
18+
19+
# This is the version number of the application being deployed. This version number should be
20+
# incremented each time you make changes to the application. Versions are not expected to
21+
# follow Semantic Versioning. They should reflect the version the application is using.
22+
# It is recommended to use it with quotes.
23+
appVersion: "1.0.0"

0 commit comments

Comments
 (0)