Skip to content

Commit 97698cb

Browse files
committed
Update packa and add Readme
1 parent 6f4847a commit 97698cb

File tree

6 files changed

+109
-3
lines changed

6 files changed

+109
-3
lines changed

packs/medik8s-1.0.0/README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Medik8s - Kubernetes Node Remediation
2+
3+
Medik8s is a project consists of several kubernetes operators that provide automatic node remediation and high availability for singleton workloads.
4+
5+
Hardware is imperfect, and software contains bugs. When node level failures such as kernel hangs or dead NICs occur, the work required from the cluster does not decrease - workloads from affected nodes need to be restarted somewhere.
6+
7+
However some workloads, such as RWO volumes and StatefulSets, may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if nodes (and the workloads running on them) are assumed to be dead whenever we stop hearing from them. For this reason it is important to know that the node has reached a safe state before initiating recovery of the workload.
8+
9+
Unfortunately it is not always practical to require admin intervention in order to confirm the node’s true status. In order to automate the recovery of exclusive workloads, Medik8s presents a collection of projects that can be installed on any kubernetes-based cluster to automate:
10+
* The detection of failures,
11+
* Putting nodes into a safe state,
12+
* Allowing the scheduler to recover affected workloads
13+
* Attempting to restore cluster capacity
14+
15+
16+
## Prerequisites
17+
18+
- A Kubernetes cluster.
19+
20+
21+
## Parameters
22+
23+
To deploy the Medik8s pack, you need to review the following parameters in the pack's YAML and adjust if necessary
24+
25+
| Name | Description | Type | Default Value | Required |
26+
| --- | --- | --- | --- | --- |
27+
| `medik8s.olm.install` | Whether to install the Operator Lifecycle Manager (OLM). If OLM is already installed in your cluster by other means, set this to `false` | Boolean | `true` | Yes |
28+
| `medik8s.olm.catalog.image` | Where to get the OperatorHub Catalog from. Adjust this if you use a private registry | String | quay.io/operatorhubio/catalog:latest | Yes |
29+
| `medik8s.nodeHealthChecks[*].spec.minHealthy` | The minimum percentage of nodes that must be healthy for the operator to act. | String | 51% | Yes |
30+
| `medik8s.nodeHealthChecks[*].spec.unhealthyConditions[*].duration` | Time that nodes can be unhealthy before the operator acts. | String | 300s | Yes |
31+
| `medik8s.oselfNodeRemediationConfigs[*].spec.apiCheckInterval` | Frequency to check connectivity with each API server | String | 15s | Yes |
32+
| `medik8s.oselfNodeRemediationConfigs[*].spec.apiServerTimeout` | Timeout to check connectivity with each API server. When this timeout elapses, the Operator starts remediation | String | 5s | Yes |
33+
| `medik8s.oselfNodeRemediationConfigs[*].spec.hostPort` | HostPort is used for internal communication between SNR agents, do not change. | Integer | 30001 | Yes |
34+
| `medik8s.oselfNodeRemediationConfigs[*].spec.isSoftwareRebootEnabled` | Specify if you want to enable software reboot of the unhealthy nodes | Boolean | true | Yes |
35+
| `medik8s.oselfNodeRemediationConfigs[*].spec.maxApiErrorThreshold` | When reaching this threshold, the node starts contacting its peers | Integer | 3 | Yes |
36+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerApiServerTimeout` | Timeout for the peer to connect the API server | String | 5s | Yes |
37+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerDialTimeout` | Timeout for establishing connection with the peer | String | 5s | Yes |
38+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerRequestTimeout` | Timeout to get a response from the peer | String | 5s | Yes |
39+
| `medik8s.oselfNodeRemediationConfigs[*].spec.peerUpdateInterval` | Frequency to update peer information, such as IP address | String | 15m | Yes |
40+
| `medik8s.oselfNodeRemediationConfigs[*].spec.watchdogFilePath` | File path of the watchdog device in the nodes. If a watchdog device is unavailable, the SelfNodeRemediationConfig CR uses a software reboot | String | /dev/watchdog | Yes |
41+
42+
43+
Review the node failure detection information at the [Medik8s website](https://www.medik8s.io/failure_detection/) for more details on healthcheck behavior.
44+
45+
## Upgrade
46+
47+
This pack deploys an operator, which takes care of phased upgrades
48+
49+
50+
## Usage
51+
52+
To use the Medik8s pack, first create a new [add-on cluster profile](https://docs.spectrocloud.com/profiles/cluster-profiles/create-cluster-profiles/create-addon-profile/), add a pack and search for the **Medik8s** pack in the Palete Community Registry. Then either accept the defaults or modify them as needed.
53+
54+
In its default configuration, the Node Healthcheck Controller will detect node failures based on Kubernetes `unhealthyConditions` (if they endure for longer than the maximum `duration`) and timeouts set in the `selfNodeRemediationConfig`.
55+
56+
Once a failure has been detected, remediation will start. You can fine tune the validation and remediation behavior by adjusting the `selfNodeRemediationConfig` section in the pack. Review the information about the [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/) on the Medik8s website.
57+
58+
Once you have configured the pack, you can deploy it to cluster.
59+
60+
61+
## References
62+
63+
- [Medik8s website](https://www.medik8s.io/failure_detection/)
64+
- [Self Node Remediation Configuration options](https://www.medik8s.io/remediation/self-node-remediation/configuration/)
219 Bytes
Binary file not shown.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{{- range $snr := .Values.selfNodeRemediationConfigs }}
2+
---
3+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
4+
kind: SelfNodeRemediationConfig
5+
metadata:
6+
name: {{ $snr.name }}
7+
namespace: {{ $snr.namespace }}
8+
spec: {{ toYaml $snr.spec | nindent 2 }}
9+
{{ end }}

packs/medik8s-1.0.0/charts/medik8s/values_lint.yaml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,19 @@ nodeHealthChecks:
6464
duration: 300s
6565
- type: Ready
6666
status: Unknown
67-
duration: 300s
67+
duration: 300s
68+
69+
selfNodeRemediationConfigs:
70+
- name: self-node-remediation-config
71+
namespace: operators
72+
spec:
73+
apiCheckInterval: 15s
74+
apiServerTimeout: 5s
75+
hostPort: 30001
76+
isSoftwareRebootEnabled: true
77+
maxApiErrorThreshold: 3
78+
peerApiServerTimeout: 5s
79+
peerDialTimeout: 5s
80+
peerRequestTimeout: 5s
81+
peerUpdateInterval: 15m
82+
watchdogFilePath: /dev/watchdog

packs/medik8s-1.0.0/pack.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
{
22
"addonType":"monitoring",
3-
"annotations": {},
3+
"annotations": {
4+
"source": "community",
5+
"contributor": "spectrocloud"
6+
},
47
"cloudTypes": [
58
"all"
69
],

packs/medik8s-1.0.0/values.yaml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,4 +68,19 @@ charts:
6868
duration: 300s
6969
- type: Ready
7070
status: Unknown
71-
duration: 300s
71+
duration: 300s
72+
73+
selfNodeRemediationConfigs:
74+
- name: self-node-remediation-config
75+
namespace: operators
76+
spec:
77+
apiCheckInterval: 15s
78+
apiServerTimeout: 5s
79+
hostPort: 30001
80+
isSoftwareRebootEnabled: true
81+
maxApiErrorThreshold: 3
82+
peerApiServerTimeout: 5s
83+
peerDialTimeout: 5s
84+
peerRequestTimeout: 5s
85+
peerUpdateInterval: 15m
86+
watchdogFilePath: /dev/watchdog

0 commit comments

Comments
 (0)