|
| 1 | +:_mod-docs-content-type: ASSEMBLY |
| 2 | +[id="replace-unhealthy-etcd-member"] |
| 3 | +include::_attributes/common-attributes.adoc[] |
| 4 | += Replacing an unhealthy etcd member |
| 5 | +:context: replace-unhealthy-etcd-member |
| 6 | + |
| 7 | +toc::[] |
| 8 | + |
| 9 | +The process to replace a single unhealthy etcd member depends on whether the etcd member is unhealthy because the machine is not running or the node is not ready, or because the etcd pod is crashlooping. |
| 10 | + |
| 11 | +[NOTE] |
| 12 | +==== |
| 13 | +If you have lost the majority of your control plane hosts, follow the disaster recovery procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state] instead of this procedure. |
| 14 | +
|
| 15 | +If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates] instead of this procedure. |
| 16 | +
|
| 17 | +If a control plane node is lost and a new one is created, the etcd cluster Operator handles generating the new TLS certificates and adding the node as an etcd member. |
| 18 | +==== |
| 19 | + |
| 20 | +// Identifying an unhealthy etcd member |
| 21 | +include::modules/restore-identify-unhealthy-etcd-member.adoc[leveloffset=+1] |
| 22 | + |
| 23 | +[.role=_additional-resources] |
| 24 | +.Additional resources |
| 25 | +* xref:../../etcd/etcd-backup-restore/etcd-backup.adoc#etcd-backup[Backing up etcd data] |
| 26 | +
|
| 27 | +// Determining the state of the unhealthy etcd member |
| 28 | +include::modules/restore-determine-state-etcd-member.adoc[leveloffset=+1] |
| 29 | +
|
| 30 | +// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready |
| 31 | +include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+2] |
| 32 | +
|
| 33 | +[role="_additional-resources"] |
| 34 | +.Additional resources |
| 35 | +* xref:../../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator] |
| 36 | +* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2023/html/assisted_installer_for_openshift_container_platform/expanding-the-cluster#installing-primary-control-plane-node-unhealthy-cluster_expanding-the-cluster[Installing a primary control plane node on an unhealthy cluster] |
| 37 | +
|
| 38 | +// Replacing an unhealthy etcd member whose etcd pod is crashlooping |
| 39 | +include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2] |
| 40 | +
|
| 41 | +// Replacing an unhealthy baremetal stopped etcd member |
| 42 | +include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+2] |
| 43 | +
|
| 44 | +[role="_additional-resources"] |
| 45 | +.Additional resources |
| 46 | +* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks] |
0 commit comments