Skip to content

Commit 96de1a8

Browse files
committed
[OSDOCS-12351]: Dividing etcd backup assembly
1 parent 7bb886f commit 96de1a8

10 files changed

+137
-153
lines changed

_topic_maps/_topic_map.yml

+8-1
Original file line numberDiff line numberDiff line change
@@ -2520,7 +2520,14 @@ Topics:
25202520
- Name: Performance considerations for etcd
25212521
File: etcd-performance
25222522
- Name: Backing up and restoring etcd data
2523-
File: etcd-backup
2523+
Dir: etcd-backup-restore
2524+
Topics:
2525+
- Name: Backing up etcd
2526+
File: etcd-backup
2527+
- Name: Replacing an unhealthy etcd member
2528+
File: replace-unhealthy-etcd-member
2529+
- Name: Disaster recovery
2530+
File: etcd-disaster-recovery
25242531
- Name: Encrypting etcd data
25252532
File: etcd-encrypt
25262533
- Name: Setting up fault-tolerant control planes that span data centers

etcd/etcd-backup-restore/_attributes

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../_attributes/
+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="etcd-backup"]
3+
include::_attributes/common-attributes.adoc[]
4+
= Backing up and restoring etcd data
5+
:context: etcd-backup
6+
7+
toc::[]
8+
9+
As the key-value store for {product-title}, etcd persists the state of all resource objects.
10+
11+
Back up the etcd data for your cluster regularly and store it in a secure location, ideally outside the {product-title} environment. Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours because the etcd snapshot has a high I/O cost.
12+
13+
Be sure to take an etcd backup before you update your cluster. Taking a backup before you update is important because when you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} 4.17.5 cluster must use an etcd backup that was taken from 4.17.5.
14+
15+
[IMPORTANT]
16+
====
17+
Back up your cluster's etcd data by performing a single invocation of the backup script on a control plane host. Do not take a backup for each control plane host.
18+
====
19+
20+
After you have an etcd backup, you can xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
21+
22+
// Backing up etcd data
23+
include::modules/backup-etcd.adoc[leveloffset=+1]
24+
25+
[role="_additional-resources"]
26+
.Additional resources
27+
* xref:../../hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc#hcp-recovering-etcd-cluster[Recovering an unhealthy etcd cluster for {hcp}]
28+
29+
// Creating automated etcd backups
30+
include::modules/etcd-creating-automated-backups.adoc[leveloffset=+1]
31+
32+
// Creating a single etcd backup
33+
include::modules/creating-single-etcd-backup.adoc[leveloffset=+2]
34+
35+
// Creating recurring etcd backups
36+
include::modules/creating-recurring-etcd-backups.adoc[leveloffset=+2]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="etcd-disaster-recovery"]
3+
include::_attributes/common-attributes.adoc[]
4+
= Disaster recovery
5+
:context: etcd-disaster-recovery
6+
7+
toc::[]
8+
9+
The disaster recovery documentation provides information for administrators on how to recover from several disaster situations that might occur with their {product-title} cluster. As an administrator, you might need to follow one or more of the following procedures to return your cluster to a working state.
10+
11+
[IMPORTANT]
12+
====
13+
Disaster recovery requires you to have at least one healthy control plane host.
14+
====
15+
16+
xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc#dr-quorum-restoration[Quorum restoration]:: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.
17+
+
18+
[NOTE]
19+
====
20+
If you have a majority of your control plane nodes still available and have an etcd quorum, xref:../../etcd/etcd-backup-restore/replace-unhealthy-etcd-member.adoc#replace-unhealthy-etcd-member[replace a single unhealthy etcd member].
21+
====
22+
23+
xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]:: This solution handles situations where you want to restore your cluster to a previous state, for example, if an administrator deletes something critical. If you have taken an etcd backup, you can restore your cluster to a previous state.
24+
+
25+
If applicable, you might also need to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
26+
+
27+
[WARNING]
28+
====
29+
Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. This procedure should only be used as a last resort.
30+
31+
Before performing a restore, see xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state[About restoring to a previous cluster state] for more information on the impact to the cluster.
32+
====
33+
34+
xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[Recovering from expired control plane certificates]:: This solution handles situations where your control plane certificates have expired. For example, if you shut down your cluster before the first certificate rotation, which occurs 24 hours after installation, your certificates will not be rotated and will expire. You can follow this procedure to recover from expired control plane certificates.
35+
36+
// Testing restore procedures
37+
include::modules/dr-testing-restore-procedures.adoc[leveloffset=+1]
38+
39+
[role="_additional-resources"]
40+
.Additional resources
41+
* xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc[Restoring to a previous cluster state]

etcd/etcd-backup-restore/images

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../images/

etcd/etcd-backup-restore/modules

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../modules/
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="replace-unhealthy-etcd-member"]
3+
include::_attributes/common-attributes.adoc[]
4+
= Replacing an unhealthy etcd member
5+
:context: replace-unhealthy-etcd-member
6+
7+
toc::[]
8+
9+
The process to replace a single unhealthy etcd member depends on whether the etcd member is unhealthy because the machine is not running or the node is not ready, or because the etcd pod is crashlooping.
10+
11+
[NOTE]
12+
====
13+
If you have lost the majority of your control plane hosts, follow the disaster recovery procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state] instead of this procedure.
14+
15+
If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates] instead of this procedure.
16+
17+
If a control plane node is lost and a new one is created, the etcd cluster Operator handles generating the new TLS certificates and adding the node as an etcd member.
18+
====
19+
20+
// Identifying an unhealthy etcd member
21+
include::modules/restore-identify-unhealthy-etcd-member.adoc[leveloffset=+1]
22+
23+
[.role=_additional-resources]
24+
.Additional resources
25+
* xref:../../etcd/etcd-backup-restore/etcd-backup.adoc#etcd-backup[Backing up etcd data]
26+
27+
// Determining the state of the unhealthy etcd member
28+
include::modules/restore-determine-state-etcd-member.adoc[leveloffset=+1]
29+
30+
// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
31+
include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+2]
32+
33+
[role="_additional-resources"]
34+
.Additional resources
35+
* xref:../../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
36+
* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2023/html/assisted_installer_for_openshift_container_platform/expanding-the-cluster#installing-primary-control-plane-node-unhealthy-cluster_expanding-the-cluster[Installing a primary control plane node on an unhealthy cluster]
37+
38+
// Replacing an unhealthy etcd member whose etcd pod is crashlooping
39+
include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2]
40+
41+
// Replacing an unhealthy baremetal stopped etcd member
42+
include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+2]
43+
44+
[role="_additional-resources"]
45+
.Additional resources
46+
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]

etcd/etcd-backup-restore/snippets

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../snippets

etcd/etcd-backup.adoc

-149
This file was deleted.

modules/dr-restoring-cluster-state-about.adoc

+2-3
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
// Module included in the following assemblies:
22
//
3-
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
4-
// * etcd/etcd-backup.adoc
3+
// * backup_and_recovery/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc
54

65
:_mod-docs-content-type: CONCEPT
76
[id="dr-scenario-2-restoring-cluster-state-about_{context}"]
8-
= Restoring to a previous cluster state
7+
= About restoring to a previous cluster state
98

109
To restore the cluster to a previous state, you must have previously backed up the `etcd` data by creating a snapshot. You will use this snapshot to restore the cluster state. For more information, see "Backing up etcd data".
1110

0 commit comments

Comments
 (0)