Skip to content

[OSDOCS-12351]: Dividing etcd backup assembly #93389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2520,7 +2520,14 @@ Topics:
- Name: Performance considerations for etcd
File: etcd-performance
- Name: Backing up and restoring etcd data
File: etcd-backup
Dir: etcd-backup-restore
Topics:
- Name: Backing up etcd
File: etcd-backup
- Name: Replacing an unhealthy etcd member
File: replace-unhealthy-etcd-member
- Name: Disaster recovery
File: etcd-disaster-recovery
- Name: Encrypting etcd data
File: etcd-encrypt
- Name: Setting up fault-tolerant control planes that span data centers
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following file: backup_and_restore/control_plane_backup_and_restore/etcd-backup.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="backup-etcd"]
= Backing up etcd
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following directory: etcd/etcd-backup-restore/etcd-disaster-recovery.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="about-dr"]
= About disaster recovery
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following directory: etcd/etcd-backup-restore/etcd-disaster-recovery.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="dr-quorum-restoration"]
= Quorum restoration
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following directory: etcd/etcd-backup-restore/etcd-disaster-recovery.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="dr-restoring-cluster-state"]
= Restoring to a previous cluster state
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following directory: etcd/etcd-backup-restore/etcd-disaster-recovery.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="dr-recovering-expired-certs"]
= Recovering from expired control plane certificates
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following file: etcd/etcd-backup-and-restore/replace-unhealthy-etcd-member.adoc

:_mod-docs-content-type: ASSEMBLY
[id="replacing-unhealthy-etcd-member"]
= Replacing an unhealthy etcd member
Expand Down
1 change: 1 addition & 0 deletions etcd/etcd-backup-restore/_attributes
36 changes: 36 additions & 0 deletions etcd/etcd-backup-restore/etcd-backup.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following file: backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="etcd-backup"]
include::_attributes/common-attributes.adoc[]
= Backing up and restoring etcd data
:context: etcd-backup

toc::[]

As the key-value store for {product-title}, etcd persists the state of all resource objects.

Back up the etcd data for your cluster regularly and store it in a secure location, ideally outside the {product-title} environment. Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours because the etcd snapshot has a high I/O cost.

Be sure to take an etcd backup before you update your cluster. Taking a backup before you update is important because when you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} 4.17.5 cluster must use an etcd backup that was taken from 4.17.5.

[IMPORTANT]
====
Back up your cluster's etcd data by performing a single invocation of the backup script on a control plane host. Do not take a backup for each control plane host.
====

After you have an etcd backup, you can xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].

// Backing up etcd data
include::modules/backup-etcd.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
* xref:../../hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc#hcp-recovering-etcd-cluster[Recovering an unhealthy etcd cluster]

// Creating automated etcd backups
include::modules/etcd-creating-automated-backups.adoc[leveloffset=+1]
include::modules/creating-single-etcd-backup.adoc[leveloffset=+2]
include::modules/creating-recurring-etcd-backups.adoc[leveloffset=+2]
81 changes: 81 additions & 0 deletions etcd/etcd-backup-restore/etcd-disaster-recovery.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following directory: backup_and_restore/control_plane_backup_and_restore/disaster-recovery/.

:_mod-docs-content-type: ASSEMBLY
[id="etcd-disaster-recovery"]
include::_attributes/common-attributes.adoc[]
= Disaster recovery
:context: etcd-disaster-recovery

toc::[]

The disaster recovery documentation provides information for administrators on how to recover from several disaster situations that might occur with their {product-title} cluster. As an administrator, you might need to follow one or more of the following procedures to return your cluster to a working state.

[IMPORTANT]
====
Disaster recovery requires you to have at least one healthy control plane host.
====

[id="etcd-dr-quorum"]
== Quorum restoration

You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss. When quorum is lost, the {product-title} API becomes read-only. After quorum is restored, the {product-title} API returns to read/write mode.

// Restoring etcd quorum for high availability clusters
include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources
* xref:../../installing/installing_bare_metal/upi/installing-bare-metal.adoc#installing-bare-metal[Installing a user-provisioned cluster on bare metal]
* xref:../../installing/installing_bare_metal/bare-metal-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_bare-metal-expanding[Replacing a bare-metal control plane node]

[NOTE]
====
If you have a majority of your control plane nodes still available and have an etcd quorum, xref:../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#replacing-unhealthy-etcd-member[replace a single unhealthy etcd member].
====

[id="etcd-dr-restore"]
== Restoring to a previous cluster state

To restore the cluster to a previous state, you must have previously backed up the `etcd` data by creating a snapshot. You will use this snapshot to restore the cluster state. For more information, see "Backing up etcd data".

If applicable, you might also need to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].

[WARNING]
====
Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. This procedure should only be used as a last resort.

Before performing a restore, see "About restoring to a previous cluster state" for more information on the impact to the cluster.
====

// About restoring to a previous cluster state
include::modules/dr-restoring-cluster-state-about.adoc[leveloffset=+2]

// Restoring to a previous cluster state for a single node
include::modules/dr-restoring-cluster-state-sno.adoc[leveloffset=+2]

// Restoring to a previous cluster state
include::modules/dr-restoring-cluster-state.adoc[leveloffset=+2]

// Restoring a cluster from etcd backup manually
include::modules/manually-restoring-cluster-etcd-backup.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources
* xref:../../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[Backing up etcd data]
* xref:../../installing/installing_bare_metal/upi/installing-bare-metal.adoc#installing-bare-metal[Installing a user-provisioned cluster on bare metal]
* xref:../../networking/accessing-hosts.adoc#accessing-hosts-on-aws_accessing-hosts[Accessing hosts on Amazon Web Services in an installer-provisioned infrastructure cluster]
* xref:../../installing/installing_bare_metal/bare-metal-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_bare-metal-expanding[Replacing a bare-metal control plane node]

include::modules/dr-scenario-cluster-state-issues.adoc[leveloffset=+2]

// Recovering from expired control plane certificates
include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+1]

//Testing restore procedures
include::modules/dr-testing-restore-procedures.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
* xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]
1 change: 1 addition & 0 deletions etcd/etcd-backup-restore/images
1 change: 1 addition & 0 deletions etcd/etcd-backup-restore/modules
55 changes: 55 additions & 0 deletions etcd/etcd-backup-restore/replace-unhealthy-etcd-member.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
//NOTE TO CONTRIBUTORS:
//
//If you update any of the content in this assembly file, be sure to also make the same changes in the assemblies in the following file: backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc.

:_mod-docs-content-type: ASSEMBLY
[id="replace-unhealthy-etcd-member"]
include::_attributes/common-attributes.adoc[]
= Replacing an unhealthy etcd member
:context: replace-unhealthy-etcd-member

toc::[]

The process to replace a single unhealthy etcd member depends on whether the etcd member is unhealthy because the machine is not running or the node is not ready, or because the etcd pod is crashlooping.

[NOTE]
====
If you have lost the majority of your control plane hosts, follow the disaster recovery procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state] instead of this procedure.

If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates] instead of this procedure.

If a control plane node is lost and a new one is created, the etcd cluster Operator handles generating the new TLS certificates and adding the node as an etcd member.
====

// Identifying an unhealthy etcd member
include::modules/restore-identify-unhealthy-etcd-member.adoc[leveloffset=+1]

// Determining the state of the unhealthy etcd member
include::modules/restore-determine-state-etcd-member.adoc[leveloffset=+1]

== Replacing the unhealthy etcd member

Depending on the state of your unhealthy etcd member, use one of the following procedures:

* Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
* Installing a primary control plane node on an unhealthy cluster
* Replacing an unhealthy etcd member whose etcd pod is crashlooping
* Replacing an unhealthy stopped baremetal etcd member

// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2024/html/installing_openshift_container_platform_with_the_assisted_installer/expanding-the-cluster#installing-primary-control-plane-node-unhealthy-cluster_expanding-the-cluster[Installing a primary control plane node on an unhealthy cluster]

// Replacing an unhealthy etcd member whose etcd pod is crashlooping
include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2]

// Replacing an unhealthy baremetal stopped etcd member
include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]
1 change: 1 addition & 0 deletions etcd/etcd-backup-restore/snippets
Loading