From d6a89870a73601d752eb7894a87b932f67826a25 Mon Sep 17 00:00:00 2001 From: Yuanruo Liang Date: Fri, 20 Aug 2021 12:29:24 +0800 Subject: [PATCH 1/2] fix: add gitignore, remove .dccache file --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index e43b0f9..ecb4e5b 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ .DS_Store +.dccache From d08689307a770e370853a7a85d600b36a6757db6 Mon Sep 17 00:00:00 2001 From: Yuanruo Liang Date: Fri, 20 Aug 2021 16:43:08 +0800 Subject: [PATCH 2/2] feat: define alert priority levels --- .../incident-response.md | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/monitoring-and-incident-response/incident-response.md b/monitoring-and-incident-response/incident-response.md index 084bf34..0c0a824 100644 --- a/monitoring-and-incident-response/incident-response.md +++ b/monitoring-and-incident-response/incident-response.md @@ -42,6 +42,18 @@ Do also note that the on-call engineer and incident manager are roles - it can b 4. If there is an ETA for restoration, what is it? 5. When & where will the next update be? +## Alerts and priority levels + +Incident priority levels are a measurement of urgency. Typically, this depends on the impact an incident has on the user or damage to the organisation. + +| Priority | Description | Examples | +|---------- |------------------------------------------------------------------------------------ |----------------------------------------------------------------------------------------------------------------- | +| P1 | A critical incident with great impact. Requires immediate attention by on-call. | A customer-facing service is down for all customers. Confidentiality or privacy is breached. User data loss. | +| P2 | A major incident with high impact. Requires immediate attention by on-call. | Customer-facing service is unavailable for a subset of customers. Core functionality is significantly impacted. | +| P3 | An incident with moderate impact. Does not require immediate attention by on-call. | Minor inconvenience to users, with a reasonable workaround available. Usable performance degradation. | +| P4 | A minor incident with low impact. Does not require immediate attention by on-call. | UI render bug that does not impede user functionality or cause brand damage. | +| P5 | Non-incidents, informational only. Does not require any action by on-call. | Used for informational purposes only. | + ## Roles and responsibilities ### On-call Engineer @@ -97,7 +109,7 @@ Do also note that the on-call engineer and incident manager are roles - it can b ### RACI Matrix for Incident Response -**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed +**R** - Responsible **A** - Accountable **C** - Consulted **I** - Informed @@ -220,7 +232,7 @@ A runbook is the documented form of a team’s procedures for conducting a task A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and planned follow-up actions to prevent the incident from recurring. -Writing a postmortem is not punishment—it is a learning opportunity for the entire team. +Writing a postmortem is not punishment—it is a learning opportunity for the entire team. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment. @@ -270,4 +282,4 @@ Fixing forward has significant downsides: * Hotfixes are usually rushed out during high stress situations, and code written in haste usually escapes the scrutiny and testing expected of regular changes * The point fix usually incurs significant system entropy and technical debt -* There is a possibility of breaking something else \ No newline at end of file +* There is a possibility of breaking something else