Skip to content

Commit 9298220

Browse files
committed
DGUK-60: Update traffic alert threshold to 25 req/s rightsized capacity
Lower DataGovUkHighTrafficRate alert threshold from 44.6 to 20.0 req/s to reflect the rightsized capacity target of 25 req/s. Previous: 80% of 55.8 req/s (100 VU load test) = 44.6 req/s Updated: 80% of 25 req/s (rightsized, 45 VU load test) = 20.0 req/s Update unit tests to match new threshold and description.
1 parent b299ee9 commit 9298220

2 files changed

Lines changed: 20 additions & 20 deletions

File tree

charts/monitoring-config/rules/datagovuk_traffic.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,23 @@ groups:
22
- name: DataGovUkTrafficAlerts
33
rules:
44
- alert: DataGovUkHighTrafficRate
5-
# Fires when the origin request rate to data.gov.uk exceeds 80% of the peak
6-
# capacity established by the NDL load test (100 VUs, 55.8 req/s, 0% errors).
7-
# 80% of 55.8 req/s = 44.6 req/s. Threshold sustained for 5 minutes.
5+
# Fires when the origin request rate to data.gov.uk exceeds 80% of the rightsized
6+
# capacity target (25 req/s, validated by NDL load test at 45 VUs).
7+
# 80% of 25 req/s = 20 req/s. Threshold sustained for 5 minutes.
88
expr: |
9-
sum(rate(fastly_rt_origin_fetches_total{service_name=~".*data.gov.uk"}[5m])) > 44.6
9+
sum(rate(fastly_rt_origin_fetches_total{service_name=~".*data.gov.uk"}[5m])) > 20.0
1010
for: 5m
1111
labels:
1212
severity: warning
1313
destination: slack-datagovuk-technical
1414
annotations:
15-
summary: data.gov.uk origin request rate exceeds 80% of load-tested capacity
15+
summary: data.gov.uk origin request rate exceeds 80% of rightsized capacity
1616
description: >-
1717
The request rate to the data.gov.uk origin has exceeded 80% of the
18-
peak capacity established by load testing (55.8 req/s at 100 virtual users).
18+
rightsized capacity target (25 req/s, validated by load testing at 45 virtual users).
1919
2020
Current rate: {{ $value | humanize }} req/s
21-
Threshold (80% of 55.8 req/s): 44.6 req/s
21+
Threshold (80% of 25 req/s): 20.0 req/s
2222
23-
Consider scaling the find and CKAN pods before traffic reaches 100% capacity.
23+
Consider scaling the find and CKAN pods before traffic reaches 100% capacity (25 req/s).
2424
runbook_url: https://docs.publishing.service.gov.uk/manual/alerts/data-gov-uk-high-traffic-alert.html

charts/monitoring-config/rules/datagovuk_traffic_tests.yaml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,31 @@ evaluation_interval: 1m
55

66
tests:
77
##
8-
# Test 1: No alert when origin request rate is below the 80% threshold (20 req/s)
8+
# Test 1: No alert when origin request rate is below the 80% threshold (10 req/s)
99
##
10-
- name: No alert when traffic is below 80% of load-tested capacity
10+
- name: No alert when traffic is below 80% of rightsized capacity
1111
interval: 1m
1212
input_series:
1313
- series: 'fastly_rt_origin_fetches_total{service_name="staging data.gov.uk"}'
14-
values: '0+1200x30' # 1200 counter increment per minute = 20 req/s — below 44.6 threshold
14+
values: '0+600x30' # 600 counter increment per minute = 10 req/s — below 20.0 threshold
1515

1616
alert_rule_test:
1717
- alertname: DataGovUkHighTrafficRate
1818
eval_time: 30m
1919
exp_alerts: []
2020

2121
##
22-
# Test 2: Alert fires when origin request rate is sustained above 80% threshold (50 req/s)
22+
# Test 2: Alert fires when origin request rate is sustained above 80% threshold (22 req/s)
2323
#
24-
# With rate([5m]) and a constant 3000/min (50 req/s) increment:
24+
# With rate([5m]) and a constant 1320/min (22 req/s) increment:
2525
# - Condition first becomes TRUE at t=5m (full 5-minute window populated)
2626
# - With for: 5m the alert FIRES at t=10m
2727
##
28-
- name: Alert fires when traffic exceeds 80% of load-tested capacity for 5+ minutes
28+
- name: Alert fires when traffic exceeds 80% of rightsized capacity for 5+ minutes
2929
interval: 1m
3030
input_series:
3131
- series: 'fastly_rt_origin_fetches_total{service_name="staging data.gov.uk"}'
32-
values: '0+3000x30' # 3000 counter increment per minute = 50 req/s — exceeds 44.6 threshold
32+
values: '0+1320x30' # 1320 counter increment per minute = 22 req/s — exceeds 20.0 threshold
3333

3434
alert_rule_test:
3535
- alertname: DataGovUkHighTrafficRate
@@ -43,13 +43,13 @@ tests:
4343
severity: warning
4444
destination: slack-datagovuk-technical
4545
exp_annotations:
46-
summary: data.gov.uk origin request rate exceeds 80% of load-tested capacity
46+
summary: data.gov.uk origin request rate exceeds 80% of rightsized capacity
4747
description: >-
4848
The request rate to the data.gov.uk origin has exceeded 80% of the
49-
peak capacity established by load testing (55.8 req/s at 100 virtual users).
49+
rightsized capacity target (25 req/s, validated by load testing at 45 virtual users).
5050
51-
Current rate: 50 req/s
52-
Threshold (80% of 55.8 req/s): 44.6 req/s
51+
Current rate: 22 req/s
52+
Threshold (80% of 25 req/s): 20.0 req/s
5353
54-
Consider scaling the find and CKAN pods before traffic reaches 100% capacity.
54+
Consider scaling the find and CKAN pods before traffic reaches 100% capacity (25 req/s).
5555
runbook_url: https://docs.publishing.service.gov.uk/manual/alerts/data-gov-uk-high-traffic-alert.html

0 commit comments

Comments
 (0)