Skip to content

Commit 4da98e7

Browse files
author
Yuanruo Liang
committed
feat: define alert priority levels
1 parent 9849617 commit 4da98e7

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

monitoring-and-incident-response/incident-response.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,18 @@ Do also note that the on-call engineer and incident manager are roles - it can b
4242
4. If there is an ETA for restoration, what is it?
4343
5. When & where will the next update be?
4444

45+
## Alerts and priority levels
46+
47+
Incident priority levels are a measurement of urgency. Typically, this depends on the impact an incident has on the user or damage to the organisation.
48+
49+
| Priority | Description | Examples |
50+
|---------- |------------------------------------------------------------------------------------ |----------------------------------------------------------------------------------------------------------------- |
51+
| P1 | A critical incident with great impact. Requires immediate attention by on-call. | A customer-facing service is down for all customers. Confidentiality or privacy is breached. User data loss. |
52+
| P2 | A major incident with high impact. Requires immediate attention by on-call. | Customer-facing service is unavailable for a subset of customers. Core functionality is significantly impacted. |
53+
| P3 | An incident with moderate impact. Does not require immediate attention by on-call. | Minor inconvenience to users, with a reasonable workaround available. Usable performance degradation. |
54+
| P4 | A minor incident with low impact. Does not require immediate attention by on-call. | UI render bug that does not impede user functionality or cause brand damage. |
55+
| P5 | Non-incidents, informational only. Does not require any action by on-call. | Used for informational purposes only. |
56+
4557
## Roles and responsibilities
4658

4759
### On-call Engineer
@@ -97,7 +109,7 @@ Do also note that the on-call engineer and incident manager are roles - it can b
97109

98110
### RACI Matrix for Incident Response
99111

100-
**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed
112+
**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed
101113

102114
<table>
103115
<tr>
@@ -220,7 +232,7 @@ A runbook is the documented form of a team’s procedures for conducting a task
220232

221233
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and planned follow-up actions to prevent the incident from recurring.
222234

223-
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
235+
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
224236

225237
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.
226238

@@ -270,4 +282,4 @@ Fixing forward has significant downsides:
270282

271283
* Hotfixes are usually rushed out during high stress situations, and code written in haste usually escapes the scrutiny and testing expected of regular changes
272284
* The point fix usually incurs significant system entropy and technical debt
273-
* There is a possibility of breaking something else
285+
* There is a possibility of breaking something else

0 commit comments

Comments
 (0)