forked from pivotal-cf/docs-ops-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpower-failure.html.md.erb
197 lines (131 loc) · 8.92 KB
/
power-failure.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: Checking Pivotal Platform State after a Power Failure on vSphere
owner:
---
<% current_page.data.title = "Checking " + vars.product_name + " State after a Power Failure on vSphere" %>
This topic describes how to check <%= vars.first_product_name %> state after a power failure in an on-premises vSphere installation.
If you have a procedure at your company for handling power failure scenarios and would to like add steps for checking that <%= vars.product_name %> is in a good state, you can use this procedure as a template.
## <a id="overview"></a> Overview
This section describes the process used by <%= vars.product_name %> to recover from power failures and exceptions to that process.
### <a id="recovery-process"></a> Automatic Recovery Process
When power returns after a failure, vSphere and <%= vars.product_name %> automatically do the following to recover your environment:
1. vSphere High Availability (HA) recovers VMs.
2. BOSH ensures the processes on those VMs are healthy, with the exception of the Ops Manager VM and the BOSH VM itself. <%= vars.product_name %> uses BOSH to deploy and manage its VMs. For more information, see [BOSH](https://bosh.io/docs).
3. The Diego runtime of Pivotal Application Service (PAS) recovers apps that were running on the VMs. For more information, see [Diego](../concepts/diego/diego-architecture.html).
### <a id="manual-recovery-scenarios"></a> Scenarios that Require Manual Intervention
There are two scenarios that can require manual intervention when recovering your environment after a power failure:
* If PAS is configured to use a MySQL cluster instead of a single node, the cluster does not recover automatically.
* If you have Ops Manager v2.5.3 or earlier and encounter the following known issue in the BOSH Director: [Monit inaccurately reports the health of UAA](https://docs.pivotal.io/pivotalcf/2-5/pcf-release-notes/opsmanager-rn.html#monit).
The procedure in this topic includes more detail about addressing these scenarios.
## <a id="checklist"></a> Checklist
Use the checklist in this section to ensure <%= vars.product_name %> is in a good state after a power failure. It includes links to sections that contain more detail about each phase.
This checklist assumes your <%= vars.product_name %> on vSphere installation is set up for vSphere HA and you have the BOSH Resurrector enabled.
<table>
<tr>
<th>Phase</th>
<th>Component</th>
<th>Action</th>
</tr>
<tr>
<td>1</td>
<td>vSphere</td>
<td><a href="#check-vSphere">Ensure vSphere is Running</a></td>
</tr>
<tr>
<td>2</td>
<td>Ops Manager</td>
<td><a href="#check-ops-manager">Ensure Ops Manager is Running</a></td>
</tr>
<tr>
<td>3</td>
<td>BOSH Director</td>
<td><a href="#check-bosh">Ensure BOSH Director is Running</a></td>
</tr>
<tr>
<td>4</td>
<td>BOSH Director</td>
<td><a href="#resurrector">Ensure BOSH Resurrector Finished Recovering</a></td>
</tr>
<tr>
<td>5</td>
<td>PAS</td>
<td><a href="#check-pas">Ensure PAS VMs are Running</a> (This may include manually recovering the MySQL cluster)</td>
</tr>
<tr>
<td>6</td>
<td>PAS</td>
<td><a href="#check-apps">Ensure Apps Hosted on PAS are Running</a></td>
</tr>
<tr>
<td>7</td>
<td><%= vars.product_name %> Healthwatch</td>
<td><a href="#check-hw">Check the Healthwatch Dashboard</a></td>
</tr>
</table>
## <a id="check-vSphere"></a> Phase 1: Ensure vSphere is Running
Ensure that vSphere is running and has fully recovered from the power failure. Check your internal vSphere monitoring dashboard.
## <a id="check-ops-manager"></a> Phase 2: Ensure Ops Manager is Running
To ensure Ops Manager is running, do the following:
1. Open vCenter and navigate to the resource pool that hosts your <%= vars.product_name %> deployment.
1. Select the **Related Objects**, and then **Virtual Machines**.
1. Locate the VM with the name `OpsMan-VERSION`, such as `OpsMan-2.6`.
1. Review the **State** and **Status** columns for the Ops Manager VM. If Ops Manager is running, they say **Powered On** and **Normal**. If this is not the case, restart the VM.
## <a id="check-bosh"></a> Phase 3: Ensure BOSH Director is Running
To ensure BOSH Director is running, do the following:
1. In a browser, navigate to the <%= vars.product_name %> Ops Manager UI and select the **BOSH Director for vSphere** tile.
<p class="note"><strong>Note</strong>: If you do not know the URL of the Ops Manager VM, you can use the IP address from vCenter.</p>
1. Select **Status**.
1. In the **BOSH Director** row, record the **CID**. The CID is the cloud ID and corresponds to the VM name in vSphere.
1. Navigate to the vCenter resource pool or cluster that hosts your <%= vars.product_name %> deployment.
1. Select **Related Objects**, and then **Virtual Machines**.
1. Locate the VM with the name that corresponds to the **CID** value you copied.
1. Review the **State** and **Status** columns for the VM. If the **State** is not **Powered On**, restart the VM.
1. If the VM is **Powered On** but **Status** does not display **Normal**, it may be due the following known issue: [Monit inaccurately reports the health of UAA](https://docs.pivotal.io/pivotalcf/2-5/pcf-release-notes/opsmanager-rn.html#monit). To resolve this issue, do the following:
1. SSH into the BOSH Director VM using the instructions in [SSH into the BOSH Director VM](../customizing/trouble-advanced.html#bosh-director-ssh).
1. Run the following command to see that all processes are running:
```
monit summary
```
1. If the `uaa` process is not running, run the following command:
```
monit restart UAA
```
## <a id="resurrector"></a> Phase 4: Ensure BOSH Resurrector Finished Recovering
If enabled, the BOSH Resurrector re-creates any VMs in a problematic state after being recovered by vSphere HA.
To ensure BOSH Resurrector finished recovering, do the following:
1. Log in to the Ops Manager VM with SSH using the instructions in [Log in to the Ops Manager VM with SSH](../customizing/trouble-advanced.html#ssh).
1. Authenticate with the BOSH Director VM using the instructions in [Authenticate with the BOSH Director VM](../customizing/trouble-advanced.html#log-in).
1. Run the following command to see if there is any currently running or queued Resurrector activity:
```
bosh tasks --all -d ''
```
Look for `scan` and `fix` in the task description. If there are no tasks running, it is likely that BOSH Director has finished recovering. You can also run `bosh tasks --recent --all -d ''` to view finished tasks.
## <a id="check-pas"></a> Phase 5: Ensure PAS VMs are Running
<p class="note"><strong>Note</strong>: You can also apply the steps in this section to any <%= vars.product_name %> services. To further ensure the health of <%= vars.product_name %> services, use the <%= vars.product_name %> Healthwatch dashboard and the documentation for each service.</p>
To ensure PAS VMs are running, do the following:
1. Run the following command to confirm that VMs are running:
```
bosh vms
```
BOSH lists VMs by deployment. The deployment with the `cf-` prefix is the PAS deployment.
1. If the `mysql` VM is not running, it is likely because it is a cluster and not a single node. Clusters require manual intervention after an outage. See [Manually Recover PAS MySQL (Clusters Only)](#check-mysql) to confirm and recover the cluster.
1. If any other VMs are not running, run the following command:
```
bosh cck -d DEPLOYMENT
```
This command scans for problems and provides options for recovering VMs. For more information, see [IaaS Reconciliation](https://bosh.io/docs/cck/) in the BOSH documentation.
1. If you cannot get all VMs running, contact [Pivotal Support](https://support.pivotal.io) for assistance. Provide the following information:
* You have started this checklist to recover from a power failure on vSphere
* A list of failing VMs
* Your <%= vars.product_name %> version
### <a id="check-mysql"></a> Manually Recover PAS MySQL (Clusters Only)
To manually recover PAS MySQL, do the following:
1. In a browser, navigate to the <%= vars.product_name %> Ops Manager UI and select the **Pivotal Application Service** tile.
1. Select the **Resource Config** pane.
1. Review the **INSTANCES** column of the **MySQL Server** job. If the number of instances is greater than `1`, manually recover MySQL by following this procedure: [Recovering From MySQL Cluster Downtime](../mysql/bootstrap-mysql.html).
## <a id="check-apps"></a> Phase 7: Ensure Apps Hosted on PAS are Running
To ensure apps hosted on PAS are running, do the following:
1. Check the status of an app your company runs on <%= vars.product_name %>. Run any healthchecks that the app has or visit the URL of the app to see that it is working.
1. Push an app to <%= vars.product_name %>.
## <a id="check-hw"></a> Phase 8: Check the Healthwatch Dashboard
You can use <%= vars.product_name %> Healthwatch to further assess the state of <%= vars.product_name %>. For more information, see [Using <%= vars.product_name %> Healthwatch](https://docs.pivotal.io/pcf-healthwatch/using.html).