Merge pull request #533 from mumoshu/etcd-backup-and-restore-doc

mumoshu · web-flow · commit 901d7b26adb3 · 2017-04-13T13:54:48.000+09:00
Add documentation for administrating etcd cluster
diff --git a/Documentation/kubernetes-on-aws-backup-and-restore-for-etcd.md b/Documentation/kubernetes-on-aws-backup-and-restore-for-etcd.md
@@ -0,0 +1,89 @@
+# Backup and Restore for etcd
+
+## Backup
+
+### Manually taking an etcd snapshot
+
+ssh into one of etcd nodes and run the following command:
+
+```bash
+set -a; source /var/run/coreos/etcdadm-environment; set +a
+/opt/bin/etcdadm save
+``````
+
+The command takes an etcd snapshot by running an appropriate `etcdctl snapshot save` command.
+The snapshot is then exported to the S3 URI: `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
+
+### Automatically taking an etcd snapshot
+
+A feature to periodically take a snapshot of an etcd cluster can be enabled by specifying: 
+```yaml
+etcd:
+  snapshot:
+    automated: true
+``` 
+in `cluster.yaml`.
+
+When enabled, the command `etcdadm save` is called periodically(every 1 minute by default) via a systemd timer.
+
+## Restore
+
+Please beware that you must have taken an etcd snapshot beforehand to restore your cluster.
+An etcd snapshot can be taken manually or automatically according to the steps described above.
+
+### Manually restoring a permanently failed etcd node from etcd snapshot
+
+It is impossible!
+However, you can recover a permanently failed etcd node, without losing data, by "resetting" the node.
+More concretely, you can run the following commands to remove the etcd member from the cluster, wipe etcd data, and then re-add the member to the cluster:
+
+```bash
+sudo systemctl stop etcd-member.service
+
+set -a; source /var/run/coreos/etcdadm-environment; set +a
+/opt/bin/etcdadm replace
+
+sudo systemctl start etcd-member.service
+```
+
+The reset member eventually catches up data from the etcd cluster hence the recovery is done without losing data.
+
+For more details, I'd suggest you to read [the revelant upstream issue](https://github.com/kubernetes/kubernetes/issues/40027#issuecomment-283501556).
+
+### Manually restoring a cluster from etcd snapshot
+
+ssh into every etcd node and stop the etcd3 process:
+
+```bash
+for h in $hosts; do
+  ssh -i path/to/your/key core@$h sudo systemctl stop etcd-member.service
+done
+```
+
+and then sart the etcd3 process:
+
+```bash
+for h in $hosts; do
+  ssh -i path/to/your/key core@$h sudo systemctl start etcd-member.service
+done
+```
+
+Doing this triggers the automated disaster recovery processes across etcd nodes by running `etcdadm-reconfigure.service`
+and your cluster will eventually be restored from the snapshot stored at `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
+
+### Automatic recovery
+
+A feature to automatically restore a permanently failed etcd member or a cluster can be enabled by specifying:
+
+```yaml
+etcd:
+  disasterRecovery:
+    automated: true
+```
+
+When enabled,
+- The command `etcdadm check` is called periodically by a systemd timer
+  - The etcd cluster and each etcd node(=member) is checked by running `etcdctl endpoint health` command
+- When up to `1/N` etcd nodes failed successive health checks, it will be removed as an etcd member and then added again as a new member
+   - The new member eventually catches up data from the etcd cluster
+- When more than `1/N` etcd nodes failed successive health checks, a disaster recovery process is executed to recover all the etcd nodes from the latest etcd snapshot
diff --git a/README.md b/README.md
@@ -44,6 +44,7 @@ Check out our getting started tutorial on launching your first Kubernetes cluste
 * [Step 7: Destroy](/Documentation/kubernetes-on-aws-destroy.md)
   * Destroy the cluster
 * **Optional Features**
+  * [Backup and restore for etcd](/Documentation/kubernetes-on-aws-backup-and-restore-for-etcd.md)
   * [Backup Kubernetes resources](/Documentation/kubernetes-on-aws-backup-restore.md)
 
 ## Examples
diff --git a/e2e/run b/e2e/run
@@ -249,7 +249,7 @@ etcd:
   fi
 
 
-  if [ "${ETCD_DISASTER_RECOVERY_AUTOMATED}" != "" ]; then
+  if [ "${ETCD_SNAPSHOT_AUTOMATED}" != "" ]; then
     echo -e "  snapshot:
     automated: true" >> cluster.yaml
   fi
diff --git a/etcdadm/README.md b/etcdadm/README.md
@@ -36,7 +36,7 @@ save it in S3
 * `etcdadm reconfigure` reconfigures the etcd member on the same node as etcdadm so that it survives:
   * `N/2` or less permanently failed members, by automatically removing a permanently failed member and then re-add it as a brand-new member with empty data according to ["Replace a failed etcd member on CoreOS Container Linux"](https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux)
   * `(N/2)+1` or more permanently failed members, by automatically initiating a new cluster, from a snapshot if it exists, according to ["etcd disaster recovery on CoreOS Container Linux"](https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#etcd-disaster-recovery-on-coreos-container-linux)  
-* `etcdadm replace` is used to manually recover from an etcd memer from a permanent failure. It resets the etcd member running on the same node as etcdadm by:
+* `etcdadm replace` is used to manually recover from an etcd member from a permanent failure. It resets the etcd member running on the same node as etcdadm by:
   1. clearing the contents of the etcd data dir 
   2. removing and then re-adding the etcd member by running `etcdctl member remove` and then `etcdctl memer add`