Skip to content

Commit 5d2ff60

Browse files
authored
Merge pull request #91132 from lahinson/ocpbugs-44113-cp-4.14
[enterprise-4.14][OCPBUGS-44113]: Revise etcd restore procedure
2 parents d11a91d + 2e56c0f commit 5d2ff60

File tree

1 file changed

+164
-8
lines changed

1 file changed

+164
-8
lines changed

Diff for: modules/restore-replace-stopped-etcd-member.adoc

+164-8
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,14 @@ You must wait if the other control plane nodes are powered off. The control plan
2828
+
2929
[IMPORTANT]
3030
====
31-
It is important to take an etcd backup before performing this procedure so that your cluster can be restored if you encounter any issues.
31+
Before you perform this procedure, take an etcd backup so that you can restore your cluster if you experience any issues.
3232
====
3333
3434
.Procedure
3535

3636
. Remove the unhealthy member.
3737

38-
.. Choose a pod that is _not_ on the affected node:
38+
.. Choose a pod that is not on the affected node:
3939
+
4040
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
4141
+
@@ -80,7 +80,7 @@ sh-4.2# etcdctl member list -w table
8080
+------------------+---------+------------------------------+---------------------------+---------------------------+
8181
----
8282
+
83-
Take note of the ID and the name of the unhealthy etcd member, because these values are needed later in the procedure. The `$ etcdctl endpoint health` command will list the removed member until the procedure of replacement is finished and a new member is added.
83+
Take note of the ID and the name of the unhealthy etcd member because these values are needed later in the procedure. The `$ etcdctl endpoint health` command will list the removed member until the procedure of replacement is finished and a new member is added.
8484

8585
.. Remove the unhealthy etcd member by providing the ID to the `etcdctl member remove` command:
8686
+
@@ -190,9 +190,16 @@ $ oc delete secret -n openshift-etcd etcd-serving-ip-10-0-131-183.ec2.internal
190190
$ oc delete secret -n openshift-etcd etcd-serving-metrics-ip-10-0-131-183.ec2.internal
191191
----
192192

193-
. Delete and re-create the control plane machine. After this machine is re-created, a new revision is forced and etcd scales up automatically.
193+
. Check whether a control plane machine set exists by entering the following command:
194194
+
195-
If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps. Otherwise, you must create the new master by using the same method that was used to originally create it.
195+
[source,terminal]
196+
----
197+
$ oc -n openshift-machine-api get controlplanemachineset
198+
----
199+
200+
* If the control plane machine set exists, delete and re-create the control plane machine. After this machine is re-created, a new revision is forced and etcd scales up automatically. For more information, see "Replacing an unhealthy etcd member whose machine is not running or whose node is not ready".
201+
+
202+
If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps. Otherwise, you must create the new control plane by using the same method that was used to originally create it.
196203
197204
.. Obtain the machine for the unhealthy member.
198205
+
@@ -226,7 +233,7 @@ $ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
226233
+
227234
A new machine is automatically provisioned after deleting the machine of the unhealthy member.
228235

229-
.. Verify that a new machine has been created:
236+
.. Verify that a new machine was created:
230237
+
231238
[source,terminal]
232239
----
@@ -246,13 +253,162 @@ clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1
246253
----
247254
<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready once the phase changes from `Provisioning` to `Running`.
248255
+
249-
It might take a few minutes for the new machine to be created. The etcd cluster Operator will automatically sync when the machine or node returns to a healthy state.
256+
It might take a few minutes for the new machine to be created. The etcd cluster Operator automatically syncs when the machine or node returns to a healthy state.
250257
+
251258
[NOTE]
252259
====
253260
Verify the subnet IDs that you are using for your machine sets to ensure that they end up in the correct availability zone.
254261
====
255262

263+
* If the control plane machine set does not exist, delete and re-create the control plane machine. After this machine is re-created, a new revision is forced and etcd scales up automatically.
264+
+
265+
If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps. Otherwise, you must create the new control plane by using the same method that was used to originally create it.
266+
267+
.. Obtain the machine for the unhealthy member.
268+
+
269+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
270+
+
271+
[source,terminal]
272+
----
273+
$ oc get machines -n openshift-machine-api -o wide
274+
----
275+
+
276+
.Example output
277+
[source,terminal]
278+
----
279+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
280+
clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1>
281+
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
282+
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
283+
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
284+
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
285+
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
286+
----
287+
<1> This is the control plane machine for the unhealthy node, `ip-10-0-131-183.ec2.internal`.
288+
289+
.. Save the machine configuration to a file on your file system:
290+
+
291+
[source,terminal]
292+
----
293+
$ oc get machine clustername-8qw5l-master-0 \ <1>
294+
-n openshift-machine-api \
295+
-o yaml \
296+
> new-master-machine.yaml
297+
----
298+
<1> Specify the name of the control plane machine for the unhealthy node.
299+
300+
.. Edit the `new-master-machine.yaml` file that was created in the previous step to assign a new name and remove unnecessary fields.
301+
302+
... Remove the entire `status` section:
303+
+
304+
[source,yaml]
305+
----
306+
status:
307+
addresses:
308+
- address: 10.0.131.183
309+
type: InternalIP
310+
- address: ip-10-0-131-183.ec2.internal
311+
type: InternalDNS
312+
- address: ip-10-0-131-183.ec2.internal
313+
type: Hostname
314+
lastUpdated: "2020-04-20T17:44:29Z"
315+
nodeRef:
316+
kind: Node
317+
name: ip-10-0-131-183.ec2.internal
318+
uid: acca4411-af0d-4387-b73e-52b2484295ad
319+
phase: Running
320+
providerStatus:
321+
apiVersion: awsproviderconfig.openshift.io/v1beta1
322+
conditions:
323+
- lastProbeTime: "2020-04-20T16:53:50Z"
324+
lastTransitionTime: "2020-04-20T16:53:50Z"
325+
message: machine successfully created
326+
reason: MachineCreationSucceeded
327+
status: "True"
328+
type: MachineCreation
329+
instanceId: i-0fdb85790d76d0c3f
330+
instanceState: stopped
331+
kind: AWSMachineProviderStatus
332+
----
333+
334+
... Change the `metadata.name` field to a new name.
335+
+
336+
Keep the same base name as the old machine and change the ending number to the next available number. In this example, `clustername-8qw5l-master-0` is changed to `clustername-8qw5l-master-3`.
337+
+
338+
For example:
339+
+
340+
[source,yaml]
341+
----
342+
apiVersion: machine.openshift.io/v1beta1
343+
kind: Machine
344+
metadata:
345+
...
346+
name: clustername-8qw5l-master-3
347+
...
348+
----
349+
350+
... Remove the `spec.providerID` field:
351+
+
352+
[source,yaml]
353+
----
354+
providerID: aws:///us-east-1a/i-0fdb85790d76d0c3f
355+
----
356+
357+
.. Delete the machine of the unhealthy member:
358+
+
359+
[source,terminal]
360+
----
361+
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
362+
----
363+
<1> Specify the name of the control plane machine for the unhealthy node.
364+
365+
.. Verify that the machine was deleted:
366+
+
367+
[source,terminal]
368+
----
369+
$ oc get machines -n openshift-machine-api -o wide
370+
----
371+
+
372+
.Example output
373+
[source,terminal]
374+
----
375+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
376+
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
377+
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
378+
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
379+
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
380+
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
381+
----
382+
383+
.. Create the new machine by using the `new-master-machine.yaml` file:
384+
+
385+
[source,terminal]
386+
----
387+
$ oc apply -f new-master-machine.yaml
388+
----
389+
390+
.. Verify that the new machine was created:
391+
+
392+
[source,terminal]
393+
----
394+
$ oc get machines -n openshift-machine-api -o wide
395+
----
396+
+
397+
.Example output
398+
[source,terminal]
399+
----
400+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
401+
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
402+
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
403+
clustername-8qw5l-master-3 Provisioning m4.xlarge us-east-1 us-east-1a 85s ip-10-0-133-53.ec2.internal aws:///us-east-1a/i-015b0888fe17bc2c8 running <1>
404+
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
405+
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
406+
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
407+
----
408+
<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready once the phase changes from `Provisioning` to `Running`.
409+
+
410+
It might take a few minutes for the new machine to be created. The etcd cluster Operator automatically syncs when the machine or node returns to a healthy state.
411+
256412
. Turn the quorum guard back on by entering the following command:
257413
+
258414
[source,terminal]
@@ -337,4 +493,4 @@ If the output from the previous command lists more than three etcd members, you
337493
[WARNING]
338494
====
339495
Be sure to remove the correct etcd member; removing a good etcd member might lead to quorum loss.
340-
====
496+
====

0 commit comments

Comments
 (0)