Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.DS_Store
.dccache
18 changes: 15 additions & 3 deletions monitoring-and-incident-response/incident-response.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,18 @@ Do also note that the on-call engineer and incident manager are roles - it can b
4. If there is an ETA for restoration, what is it?
5. When & where will the next update be?

## Alerts and priority levels

Incident priority levels are a measurement of urgency. Typically, this depends on the impact an incident has on the user or damage to the organisation.

| Priority | Description | Examples |
|---------- |------------------------------------------------------------------------------------ |----------------------------------------------------------------------------------------------------------------- |
| P1 | A critical incident with great impact. Requires immediate attention by on-call. | A customer-facing service is down for all customers. Confidentiality or privacy is breached. User data loss. |
| P2 | A major incident with high impact. Requires immediate attention by on-call. | Customer-facing service is unavailable for a subset of customers. Core functionality is significantly impacted. |
| P3 | An incident with moderate impact. Does not require immediate attention by on-call. | Minor inconvenience to users, with a reasonable workaround available. Usable performance degradation. |
| P4 | A minor incident with low impact. Does not require immediate attention by on-call. | UI render bug that does not impede user functionality or cause brand damage. |
| P5 | Non-incidents, informational only. Does not require any action by on-call. | Used for informational purposes only. |

Comment on lines +45 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight nitpick for terminology, for incidents I've seen folks refer to severity rather than priority.

Also just for reference, Zendesk made a distinction for "Confidentiality or privacy is breached" to be in a SEV0 category. That typically activated more folks to handle special communication, possibly perform security audit to identify what exploit was used, and to make a plan for remediations.

## Roles and responsibilities

### On-call Engineer
Expand Down Expand Up @@ -97,7 +109,7 @@ Do also note that the on-call engineer and incident manager are roles - it can b

### RACI Matrix for Incident Response

**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed
**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed

<table>
<tr>
Expand Down Expand Up @@ -220,7 +232,7 @@ A runbook is the documented form of a team’s procedures for conducting a task

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and planned follow-up actions to prevent the incident from recurring.

Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.

For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.

Expand Down Expand Up @@ -270,4 +282,4 @@ Fixing forward has significant downsides:

* Hotfixes are usually rushed out during high stress situations, and code written in haste usually escapes the scrutiny and testing expected of regular changes
* The point fix usually incurs significant system entropy and technical debt
* There is a possibility of breaking something else
* There is a possibility of breaking something else