Description
Acceptance Criteria
Add Monitors for the following (this list is WIP):
-
High number of 500s
-
High Number of instances (indicates Autoscaling is working, but consuming too many resources)
-
High Latency on Page Load (indicates overall site performance degradation)
-
High number of jobs enqueued in Redis (indicates celery workers aren't keeping up with demand)
-
Synthetic pageload tests failing (canary uptime test)
-
Ensure Read Replica DBs are also monitored
-
High Mem on DBs
-
High CPU on DBs (DB load is a known bottleneck, may need to find and kill long running queries)
-
High Latency on DB Querries (indicates inefficient queries, or high db load)
-
High CPU on cluster (indicates Autoscaling is lagging behind demand)
XD Links:
Tech Details:
Open Questions:
Notes/Assumptions: