You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: monitoring-and-incident-response/incident-response.md
+15-3Lines changed: 15 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,18 @@ Do also note that the on-call engineer and incident manager are roles - it can b
42
42
4. If there is an ETA for restoration, what is it?
43
43
5. When & where will the next update be?
44
44
45
+
## Alerts and priority levels
46
+
47
+
Incident priority levels are a measurement of urgency. Typically, this depends on the impact an incident has on the user or damage to the organisation.
| P1 | A critical incident with great impact. Requires immediate attention by on-call. | A customer-facing service is down for all customers. Confidentiality or privacy is breached. User data loss. |
52
+
| P2 | A major incident with high impact. Requires immediate attention by on-call. | Customer-facing service is unavailable for a subset of customers. Core functionality is significantly impacted. |
53
+
| P3 | An incident with moderate impact. Does not require immediate attention by on-call. | Minor inconvenience to users, with a reasonable workaround available. Usable performance degradation. |
54
+
| P4 | A minor incident with low impact. Does not require immediate attention by on-call. | UI render bug that does not impede user functionality or cause brand damage. |
55
+
| P5 | Non-incidents, informational only. Does not require any action by on-call. | Used for informational purposes only. |
56
+
45
57
## Roles and responsibilities
46
58
47
59
### On-call Engineer
@@ -97,7 +109,7 @@ Do also note that the on-call engineer and incident manager are roles - it can b
@@ -220,7 +232,7 @@ A runbook is the documented form of a team’s procedures for conducting a task
220
232
221
233
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and planned follow-up actions to prevent the incident from recurring.
222
234
223
-
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
235
+
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
224
236
225
237
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.
226
238
@@ -270,4 +282,4 @@ Fixing forward has significant downsides:
270
282
271
283
* Hotfixes are usually rushed out during high stress situations, and code written in haste usually escapes the scrutiny and testing expected of regular changes
272
284
* The point fix usually incurs significant system entropy and technical debt
273
-
* There is a possibility of breaking something else
285
+
* There is a possibility of breaking something else
0 commit comments