Our infrastructure is hosted using MoJ's Cloud Platform. Resources are defined in the cloud-platform-environments repository, on a per-environment basis.
This defines the infrastructure for the express app. This includes:
- Ingress - defining web access to the app
- Autoscaling - provisioning the required amount of pods to manage the load
- Config - configurable values for the app
- Secrets - secret config values for the app
The following sections are short runbooks to explain what each alert means, and actions that could be taken toremedy the issue.
This alert fires when, over a 3 minute period, more than 1% of requests were slower than 5 seconds. In other words the p99 latency is > 5s. This means that some users are experiencing slow loading of the website. Action could include:
- Reviewing the ingress dashboard - the link is in the Slack message.
- Checking the pods e.g.
NSP=care-arrangement-plan-prod
kubectl -n $NSP get pods -owide
This alert fires when, over a 3 minute period, more than 10% of requests were slower than 7.5 seconds. In other words the p90 latency is > 7.5s. This means that a significant percentage of users are experiencing slow loading of the website. Check the actions under the slow response section. Additional action could include:
- Restart the pods with
kubectl rollout restart deployment/$NSP-deployment -n $NSP
This alert fires when, over a 3 minute period, x% of responses are 4xx responses. Note: As this alert was frequently being triggered by bots hitting 4xx pages throughout non-working hours, it hasbeen restricted to only firing during weekdays and working hours. This could indicate an issue with the database, or that a significant percentage of content has been deleted without adding a redirect. Action could include:
- Reviewing Grafana to understand the extent and timeline of the timeline and identify the specific 4cc codes (e.g. is it mostly 400 or 404?).
- Checking the health of infrastructure.
- Verifying content was not unintentionally deleted.
- Investigating client request patterns to see if a new integration or client application is misbehaving.
This alert fires when, over a 3 minute period, 5% of responses are 5xx errors. This could indicate an issue with the codebase throwing a server error, or failing infrastructure like ElastiCache. Action could include:
- Reviewing logs for error messages.
- Reviewing recent deployments, looking for a bug.
- Checking the health of infrastructure.
This alert fires when the ElastiCache CPU utilisation is above 70%. Actions could include:
- Review and monitor the Grafana dashboard.
- Provision a larger instance.
This alert fires when the ElastiCache freeable memory is lower than 500MB. Actions could include:
- Review and monitor the Grafana dashboard.
- Provision a larger instance.