Skip to content

Commit a1a872b

Browse files
committed
Disable SNS notifications for alarm that was noisier than intended
DTADFIC, a queue of 100 items can sometimes not send any work to hadoop
1 parent 5a84913 commit a1a872b

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

modules/imputation-server/monitoring.tf

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
/// This alarm is a useful idea in theory, but it's noisy, and doesn't operate on the timescales we need
2+
// It also can't account for the dual-queue design of cloudgene (the "active" vs "queued" feature): if 15 jobs
3+
// are exporting, then even though there are 100 jobs in queue, hadoop won't be sent work, and will signal "all clear"
4+
// No amount of alarm cleverness can compensate for a webapp that hides information from the system, which makes it hard to fix alarm just from the AWS side.
5+
// We'll keep the alarm defined and tracking metrics, in case it aids future capacity planning. But it won't send alerts.
16
resource "aws_cloudwatch_metric_alarm" "cluster_needs_resources" {
27
# Warn if the system is unable to scale enough, after several hours of trying. Resolved by:
38
# a) add spot capacity (if we're blitzed with lots of jobs),
@@ -13,7 +18,7 @@ resource "aws_cloudwatch_metric_alarm" "cluster_needs_resources" {
1318
datapoints_to_alarm = 24
1419
evaluation_periods = 24
1520

16-
actions_enabled = true
21+
actions_enabled = false # Don't send alerts- see notes.
1722

1823
# Notify when the alarm changes state- for good or bad
1924
alarm_actions = [var.alert_sns_arn]

0 commit comments

Comments
 (0)