Skip to content
This repository was archived by the owner on Sep 30, 2020. It is now read-only.

Commit 901d7b2

Browse files
authored
Merge pull request #533 from mumoshu/etcd-backup-and-restore-doc
Add documentation for administrating etcd cluster
2 parents e6de3dc + 4bcfbb1 commit 901d7b2

File tree

4 files changed

+92
-2
lines changed

4 files changed

+92
-2
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Backup and Restore for etcd
2+
3+
## Backup
4+
5+
### Manually taking an etcd snapshot
6+
7+
ssh into one of etcd nodes and run the following command:
8+
9+
```bash
10+
set -a; source /var/run/coreos/etcdadm-environment; set +a
11+
/opt/bin/etcdadm save
12+
``````
13+
14+
The command takes an etcd snapshot by running an appropriate `etcdctl snapshot save` command.
15+
The snapshot is then exported to the S3 URI: `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
16+
17+
### Automatically taking an etcd snapshot
18+
19+
A feature to periodically take a snapshot of an etcd cluster can be enabled by specifying:
20+
```yaml
21+
etcd:
22+
snapshot:
23+
automated: true
24+
```
25+
in `cluster.yaml`.
26+
27+
When enabled, the command `etcdadm save` is called periodically(every 1 minute by default) via a systemd timer.
28+
29+
## Restore
30+
31+
Please beware that you must have taken an etcd snapshot beforehand to restore your cluster.
32+
An etcd snapshot can be taken manually or automatically according to the steps described above.
33+
34+
### Manually restoring a permanently failed etcd node from etcd snapshot
35+
36+
It is impossible!
37+
However, you can recover a permanently failed etcd node, without losing data, by "resetting" the node.
38+
More concretely, you can run the following commands to remove the etcd member from the cluster, wipe etcd data, and then re-add the member to the cluster:
39+
40+
```bash
41+
sudo systemctl stop etcd-member.service
42+
43+
set -a; source /var/run/coreos/etcdadm-environment; set +a
44+
/opt/bin/etcdadm replace
45+
46+
sudo systemctl start etcd-member.service
47+
```
48+
49+
The reset member eventually catches up data from the etcd cluster hence the recovery is done without losing data.
50+
51+
For more details, I'd suggest you to read [the revelant upstream issue](https://github.com/kubernetes/kubernetes/issues/40027#issuecomment-283501556).
52+
53+
### Manually restoring a cluster from etcd snapshot
54+
55+
ssh into every etcd node and stop the etcd3 process:
56+
57+
```bash
58+
for h in $hosts; do
59+
ssh -i path/to/your/key core@$h sudo systemctl stop etcd-member.service
60+
done
61+
```
62+
63+
and then sart the etcd3 process:
64+
65+
```bash
66+
for h in $hosts; do
67+
ssh -i path/to/your/key core@$h sudo systemctl start etcd-member.service
68+
done
69+
```
70+
71+
Doing this triggers the automated disaster recovery processes across etcd nodes by running `etcdadm-reconfigure.service`
72+
and your cluster will eventually be restored from the snapshot stored at `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
73+
74+
### Automatic recovery
75+
76+
A feature to automatically restore a permanently failed etcd member or a cluster can be enabled by specifying:
77+
78+
```yaml
79+
etcd:
80+
disasterRecovery:
81+
automated: true
82+
```
83+
84+
When enabled,
85+
- The command `etcdadm check` is called periodically by a systemd timer
86+
- The etcd cluster and each etcd node(=member) is checked by running `etcdctl endpoint health` command
87+
- When up to `1/N` etcd nodes failed successive health checks, it will be removed as an etcd member and then added again as a new member
88+
- The new member eventually catches up data from the etcd cluster
89+
- When more than `1/N` etcd nodes failed successive health checks, a disaster recovery process is executed to recover all the etcd nodes from the latest etcd snapshot

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ Check out our getting started tutorial on launching your first Kubernetes cluste
4444
* [Step 7: Destroy](/Documentation/kubernetes-on-aws-destroy.md)
4545
* Destroy the cluster
4646
* **Optional Features**
47+
* [Backup and restore for etcd](/Documentation/kubernetes-on-aws-backup-and-restore-for-etcd.md)
4748
* [Backup Kubernetes resources](/Documentation/kubernetes-on-aws-backup-restore.md)
4849

4950
## Examples

e2e/run

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ etcd:
249249
fi
250250

251251

252-
if [ "${ETCD_DISASTER_RECOVERY_AUTOMATED}" != "" ]; then
252+
if [ "${ETCD_SNAPSHOT_AUTOMATED}" != "" ]; then
253253
echo -e " snapshot:
254254
automated: true" >> cluster.yaml
255255
fi

etcdadm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ save it in S3
3636
* `etcdadm reconfigure` reconfigures the etcd member on the same node as etcdadm so that it survives:
3737
* `N/2` or less permanently failed members, by automatically removing a permanently failed member and then re-add it as a brand-new member with empty data according to ["Replace a failed etcd member on CoreOS Container Linux"](https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux)
3838
* `(N/2)+1` or more permanently failed members, by automatically initiating a new cluster, from a snapshot if it exists, according to ["etcd disaster recovery on CoreOS Container Linux"](https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#etcd-disaster-recovery-on-coreos-container-linux)
39-
* `etcdadm replace` is used to manually recover from an etcd memer from a permanent failure. It resets the etcd member running on the same node as etcdadm by:
39+
* `etcdadm replace` is used to manually recover from an etcd member from a permanent failure. It resets the etcd member running on the same node as etcdadm by:
4040
1. clearing the contents of the etcd data dir
4141
2. removing and then re-adding the etcd member by running `etcdctl member remove` and then `etcdctl memer add`
4242

0 commit comments

Comments
 (0)