Skip to content

Commit f9d3bab

Browse files
author
Charlie Chen
committed
Add module code, CI, makefile, etc.
1 parent 650c38a commit f9d3bab

8 files changed

Lines changed: 609 additions & 0 deletions

File tree

.travis.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
addons:
2+
apt:
3+
packages:
4+
- git
5+
- make
6+
- curl
7+
8+
install:
9+
- make init
10+
11+
script:
12+
- make terraform/install
13+
- make terraform/get-plugins
14+
- make terraform/get-modules
15+
- make terraform/lint
16+
- make terraform/validate

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
SHELL := /bin/bash
2+
TERRAFORM_VERSION ?= 0.11.11
3+
4+
-include $(shell curl -sSL -o .build-harness "https://git.io/build-harness"; echo .build-harness)
5+
6+
## Lint terraform code
7+
lint:
8+
$(SELF) terraform/install terraform/get-modules terraform/get-plugins terraform/lint terraform/validate

README.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# terraform-aws-elasticsearch-cloudwatch-sns-alarms
2+
3+
[![Build Status](https://travis-ci.org/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.svg?branch=master)](https://travis-ci.org/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms)
4+
[![Latest Release](https://img.shields.io/github/release/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.svg)](https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/releases)
5+
6+
Terraform module that configures important elasticsearch alerts using CloudWatch and sends them to an SNS topic.
7+
8+
Create a set of sane Elasticsearch CloudWatch alerts for monitoring the health of an elasticsearch cluster.
9+
10+
This project is inspired by [CloudPosse](https://github.com/cloudposse)
11+
12+
It's 100% Open Source and licensed under the [APACHE2](LICENSE).
13+
14+
## Usage
15+
16+
| area | metric | comparison operator | threshold | rationale |
17+
|------------|---------------------------|---------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
18+
| Sharding | ClusterStatus.red | `>=` | 1 | At least one primary shard and its replicas are not allocated to a node for 1 minute 1 consecutive time. See [Red Cluster Status](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#aes-handling-errors-red-cluster-status). |
19+
| Sharding | ClusterStatus.yellow | `>=` | 1 | At least one replica shard is not allocated to a node for 1 minute 1 consecutive time. See [Yellow Cluster Status](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#aes-handling-errors-yellow-cluster-status). |
20+
| Storage | FreeStorageSpace | `<=` | 20480 MB | A node in your cluster is down to 20 GiB of free storage space for 1 minute 1 consecutive time. See Lack of Available Storage Space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node. |
21+
| Storage | ClusterIndexWritesBlocked | `>=` | 1 | The cluster is blocking write requests for 5 minutes 1 consecutive time. See (ClusterBlockException)[https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#troubleshooting-cluster-block] |
22+
| Node Count | Nodes | `<` | `x` | `x` is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. |
23+
| Snapshot | AutomatedSnapshotFailure | `>=` | 1 | An automated snapshot failed for 1 minute 1 consecutive time. This failure is often the result of a red cluster health status. See [Red Cluster Status](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#aes-handling-errors-red-cluster-status). |
24+
| CPU | CPUUtilization | `>=` | 80 % | CPU utilization average is >= 80% for 15 minutes, 3 consecutive times for the node cluster. |
25+
| Memory | JVMMemoryPressure | `>=` | 80 % | JVMMemoryPressure maximum is >= 80% for 15 minutes, 1 consecutive time. |
26+
| CPU | MasterCPUUtilization | `>=` | 80 % | Dedicated master nodes' CPU utilization is >= 80% for 15 minutes, 3 consecutive times. |
27+
| Memory | MasterJVMMemoryPressure | `>=` | 80 % | Dedicated master nodes' maximum JVM memory usage is >= 80% for 15 minutes, 1 consecutive time. |
28+
29+
## Examples
30+
31+
See the [`examples/`](examples/) directory for working examples.
32+
33+
```hcl
34+
resource "aws_elasticsearch_domain" "es" {
35+
domain_name = "example"
36+
elasticsearch_version = "6.3"
37+
38+
cluster_config {
39+
instance_type = "r4.large.elasticsearch"
40+
}
41+
42+
snapshot_options {
43+
automated_snapshot_start_hour = 23
44+
}
45+
46+
tags = {
47+
Domain = "TestDomain"
48+
}
49+
}
50+
51+
module "es_alarms" {
52+
source = "github::https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms.git?ref=master"
53+
domain_name = "example"
54+
}
55+
```
56+
57+
58+
## Inputs
59+
60+
| Name | Description | Type | Default | Required |
61+
|------|-------------|:----:|:-----:|:-----:|
62+
| alarm_name_postfix | Alarm name postfix | string | `""` | no |
63+
| alarm_name_prefix | Alarm name prefix | string | `""` | no |
64+
| cpu_utilization_threshold | The maximum percentage of CPU utilization | string | `80` | no |
65+
| domain_name | The Elasticserach domain name you want to monitor. | string | - | yes |
66+
| free_storage_space_threshold | The minimum amount of available storage space in Byte. | string | `21474836480` | no |
67+
| jvm_memory_pressure_threshold | The maximum percentage of the Java heap used for all data nodes in the cluster | string | `80` | no |
68+
| master_cpu_utilization_threshold | The maximum percentage of CPU utilization of master nodes | string | `""` | no |
69+
| master_jvm_memory_pressure_threshold | The maximum percentage of the Java heap used for master nodes in the cluster | string | `""` | no |
70+
| min_available_nodes | The minimum available (reachable) nodes to have | string | `1` | no |
71+
| monitor_automated_snapshot_failure | Enable monitoring of automated snapshot failure | string | `true` | no |
72+
| monitor_cluster_index_writes_blocked | Enable monitoring of cluster index writes being blocked | string | `true` | no |
73+
| monitor_cluster_status_is_red | Enable monitoring of cluster status is in red | string | `true` | no |
74+
| monitor_cluster_status_is_yellow | Enable monitoring of cluster status is in yellow | string | `true` | no |
75+
| monitor_cpu_utilization_too_high | Enable monitoring of CPU utilization is too high | string | `true` | no |
76+
| monitor_free_storage_space_too_low | Enable monitoring of cluster average free storage is to low | string | `true` | no |
77+
| monitor_insufficient_available_nodes | Enable monitoring insufficient available nodes | string | `false` | no |
78+
| monitor_jvm_memory_pressure_too_high | Enable monitoring of JVM memory pressure is too high | string | `true` | no |
79+
| monitor_master_cpu_utilization_too_high | Enable monitoring of CPU utilization of master nodes are too high. Only enable this when dedicated master is enabled | string | `false` | no |
80+
| monitor_master_jvm_memory_pressure_too_high | Enable monitoring of JVM memory pressure of master nodes are too high. Only enable this wwhen dedicated master is enabled | string | `false` | no |
81+
| sns_topic | SNS topic you want to specify. If leave empty, it will use a prefix and a timestampe appended | string | `""` | no |
82+
83+
## Outputs
84+
85+
| Name | Description |
86+
|------|-------------|
87+
| sns_topic_arn | The ARN of the SNS topic |
88+
89+
## Share the Love
90+
91+
Like this project? Please give it a ★ on [our GitHub](https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms)!
92+
93+
## Help
94+
95+
**Got a question?**
96+
97+
File a GitHub [issue](https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/issues).
98+
99+
### Bug Reports & Feature Requests
100+
101+
Please use the [issue tracker](https://github.com/dubiety/terraform-aws-elasticsearch-cloudwatch-sns-alarms/issues) to report any bugs or file feature requests.
102+
103+
## License
104+
105+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
106+
107+
See [LICENSE](LICENSE) for full details.
108+
109+
Licensed to the Apache Software Foundation (ASF) under one
110+
or more contributor license agreements. See the NOTICE file
111+
distributed with this work for additional information
112+
regarding copyright ownership. The ASF licenses this file
113+
to you under the Apache License, Version 2.0 (the
114+
"License"); you may not use this file except in compliance
115+
with the License. You may obtain a copy of the License at
116+
117+
https://www.apache.org/licenses/LICENSE-2.0
118+
119+
Unless required by applicable law or agreed to in writing,
120+
software distributed under the License is distributed on an
121+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
122+
KIND, either express or implied. See the License for the
123+
specific language governing permissions and limitations
124+
under the License.

alarms.tf

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
locals {
2+
thresholds = {
3+
FreeStorageSpaceThreshold = "${max(var.free_storage_space_threshold, 0)}"
4+
MinimumAvailableNodes = "${max(var.min_available_nodes, 0)}"
5+
CPUUtilizationThreshold = "${min(max(var.cpu_utilization_threshold, 0), 100)}"
6+
JVMMemoryPressureThreshold = "${min(max(var.jvm_memory_pressure_threshold, 0), 100)}"
7+
MasterCPUUtilizationThreshold = "${min(max(coalesce(var.master_cpu_utilization_threshold, var.cpu_utilization_threshold), 0), 100)}"
8+
MasterJVMMemoryPressureThreshold = "${min(max(coalesce(var.master_jvm_memory_pressure_threshold, var.jvm_memory_pressure_threshold), 0), 100)}"
9+
}
10+
}
11+
12+
resource "aws_cloudwatch_metric_alarm" "cluster_status_is_red" {
13+
count = "${var.monitor_cluster_status_is_red}"
14+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-ClusterStatusIsRed${var.alarm_name_postfix}"
15+
comparison_operator = "GreaterThanOrEqualToThreshold"
16+
evaluation_periods = "1"
17+
metric_name = "ClusterStatus.red"
18+
namespace = "AWS/ES"
19+
period = "60"
20+
statistic = "Average"
21+
threshold = "1"
22+
alarm_description = "Average elasticsearch cluster status is in red over last 5 minutes"
23+
alarm_actions = ["${local.aws_sns_topic_arn}"]
24+
ok_actions = ["${local.aws_sns_topic_arn}"]
25+
26+
dimensions {
27+
DomainName = "${var.domain_name}"
28+
ClientId = "${data.aws_caller_identity.default.account_id}"
29+
}
30+
}
31+
32+
resource "aws_cloudwatch_metric_alarm" "cluster_status_is_yellow" {
33+
count = "${var.monitor_cluster_status_is_yellow}"
34+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-ClusterStatusIsYellow${var.alarm_name_postfix}"
35+
comparison_operator = "GreaterThanOrEqualToThreshold"
36+
evaluation_periods = "1"
37+
metric_name = "ClusterStatus.yellow"
38+
namespace = "AWS/ES"
39+
period = "60"
40+
statistic = "Average"
41+
threshold = "1"
42+
alarm_description = "Average elasticsearch cluster status is in yellow over last 5 minutes"
43+
alarm_actions = ["${local.aws_sns_topic_arn}"]
44+
ok_actions = ["${local.aws_sns_topic_arn}"]
45+
46+
dimensions {
47+
DomainName = "${var.domain_name}"
48+
ClientId = "${data.aws_caller_identity.default.account_id}"
49+
}
50+
}
51+
52+
resource "aws_cloudwatch_metric_alarm" "free_storage_space_too_low" {
53+
count = "${var.monitor_free_storage_space_too_low}"
54+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-FreeStorageSpaceTooLow${var.alarm_name_postfix}"
55+
comparison_operator = "LessThanOrEqualToThreshold"
56+
evaluation_periods = "1"
57+
metric_name = "FreeStorageSpace"
58+
namespace = "AWS/ES"
59+
period = "60"
60+
statistic = "Average"
61+
threshold = "${local.thresholds["FreeStorageSpaceThreshold"]}"
62+
alarm_description = "Average elasticsearch free storage space over last 1 minutes is too low"
63+
alarm_actions = ["${local.aws_sns_topic_arn}"]
64+
ok_actions = ["${local.aws_sns_topic_arn}"]
65+
66+
dimensions {
67+
DomainName = "${var.domain_name}"
68+
ClientId = "${data.aws_caller_identity.default.account_id}"
69+
}
70+
}
71+
72+
resource "aws_cloudwatch_metric_alarm" "cluster_index_writes_blocked" {
73+
count = "${var.monitor_cluster_index_writes_blocked}"
74+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-ClusterIndexWritesBlocked${var.alarm_name_postfix}"
75+
comparison_operator = "GreaterThanOrEqualToThreshold"
76+
evaluation_periods = "1"
77+
metric_name = "ClusterIndexWritesBlocked"
78+
namespace = "AWS/ES"
79+
period = "300"
80+
statistic = "Average"
81+
threshold = "1"
82+
alarm_description = "Elasticsearch index writes being blocker over last 10 minutes"
83+
alarm_actions = ["${local.aws_sns_topic_arn}"]
84+
ok_actions = ["${local.aws_sns_topic_arn}"]
85+
86+
dimensions {
87+
DomainName = "${var.domain_name}"
88+
ClientId = "${data.aws_caller_identity.default.account_id}"
89+
}
90+
}
91+
92+
resource "aws_cloudwatch_metric_alarm" "insufficient_available_nodes" {
93+
count = "${var.monitor_insufficient_available_nodes}"
94+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-InsufficientAvailableNodes${var.alarm_name_postfix}"
95+
comparison_operator = "LessThanOrEqualToThreshold"
96+
evaluation_periods = "1"
97+
metric_name = "Nodes"
98+
namespace = "AWS/ES"
99+
period = "86400"
100+
statistic = "Minimum"
101+
threshold = "${local.thresholds["MinimumAvailableNodes"]}"
102+
alarm_description = "Elasticsearch nodes minimum < ${local.thresholds["MinimumAvailableNodes"]} for 1 day"
103+
alarm_actions = ["${local.aws_sns_topic_arn}"]
104+
ok_actions = ["${local.aws_sns_topic_arn}"]
105+
106+
dimensions {
107+
DomainName = "${var.domain_name}"
108+
ClientId = "${data.aws_caller_identity.default.account_id}"
109+
}
110+
}
111+
112+
resource "aws_cloudwatch_metric_alarm" "automated_snapshot_failure" {
113+
count = "${var.monitor_automated_snapshot_failure}"
114+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-AutomatedSnapshotFailure${var.alarm_name_postfix}"
115+
comparison_operator = "GreaterThanOrEqualToThreshold"
116+
evaluation_periods = "1"
117+
metric_name = "AutomatedSnapshotFailure"
118+
namespace = "AWS/ES"
119+
period = "600"
120+
statistic = "Maximum"
121+
threshold = "1"
122+
alarm_description = "Elasticsearch automated snapshot failed over last 10 minutes"
123+
alarm_actions = ["${local.aws_sns_topic_arn}"]
124+
ok_actions = ["${local.aws_sns_topic_arn}"]
125+
126+
dimensions {
127+
DomainName = "${var.domain_name}"
128+
ClientId = "${data.aws_caller_identity.default.account_id}"
129+
}
130+
}
131+
132+
resource "aws_cloudwatch_metric_alarm" "cpu_utilization_too_high" {
133+
count = "${var.monitor_cpu_utilization_too_high}"
134+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-CPUUtilizationTooHigh${var.alarm_name_postfix}"
135+
comparison_operator = "GreaterThanOrEqualToThreshold"
136+
evaluation_periods = "3"
137+
metric_name = "CPUUtilization"
138+
namespace = "AWS/ES"
139+
period = "900"
140+
statistic = "Average"
141+
threshold = "${local.thresholds["CPUUtilizationThreshold"]}"
142+
alarm_description = "Average elasticsearch cluster CPU utilization over last 10 minutes too high"
143+
alarm_actions = ["${local.aws_sns_topic_arn}"]
144+
ok_actions = ["${local.aws_sns_topic_arn}"]
145+
146+
dimensions {
147+
DomainName = "${var.domain_name}"
148+
ClientId = "${data.aws_caller_identity.default.account_id}"
149+
}
150+
}
151+
152+
resource "aws_cloudwatch_metric_alarm" "jvm_memory_pressure_too_high" {
153+
count = "${var.monitor_jvm_memory_pressure_too_high}"
154+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-JVMMemoryPressure${var.alarm_name_postfix}"
155+
comparison_operator = "GreaterThanOrEqualToThreshold"
156+
evaluation_periods = "1"
157+
metric_name = "JVMMemoryPressure"
158+
namespace = "AWS/ES"
159+
period = "900"
160+
statistic = "Maximum"
161+
threshold = "${local.thresholds["JVMMemoryPressureThreshold"]}"
162+
alarm_description = "Elasticsearch JVM memory pressure is too high over last 10 minutes"
163+
alarm_actions = ["${local.aws_sns_topic_arn}"]
164+
ok_actions = ["${local.aws_sns_topic_arn}"]
165+
166+
dimensions {
167+
DomainName = "${var.domain_name}"
168+
ClientId = "${data.aws_caller_identity.default.account_id}"
169+
}
170+
}
171+
172+
resource "aws_cloudwatch_metric_alarm" "master_cpu_utilization_too_high" {
173+
count = "${var.monitor_master_cpu_utilization_too_high}"
174+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-MasterCPUUtilizationTooHigh${var.alarm_name_postfix}"
175+
comparison_operator = "GreaterThanOrEqualToThreshold"
176+
evaluation_periods = "3"
177+
metric_name = "MasterCPUUtilization"
178+
namespace = "AWS/ES"
179+
period = "900"
180+
statistic = "Average"
181+
threshold = "${local.thresholds["MasterCPUUtilizationThreshold"]}"
182+
alarm_description = "Average elasticsearch cluster CPU utilization over last 10 minutes too high"
183+
alarm_actions = ["${local.aws_sns_topic_arn}"]
184+
ok_actions = ["${local.aws_sns_topic_arn}"]
185+
186+
dimensions {
187+
DomainName = "${var.domain_name}"
188+
ClientId = "${data.aws_caller_identity.default.account_id}"
189+
}
190+
}
191+
192+
resource "aws_cloudwatch_metric_alarm" "master_jvm_memory_pressure_too_high" {
193+
count = "${var.monitor_master_jvm_memory_pressure_too_high}"
194+
alarm_name = "${var.alarm_name_prefix}ElasticSearch-JVMMemoryPressure${var.alarm_name_postfix}"
195+
comparison_operator = "GreaterThanOrEqualToThreshold"
196+
evaluation_periods = "1"
197+
metric_name = "MasterJVMMemoryPressure"
198+
namespace = "AWS/ES"
199+
period = "900"
200+
statistic = "Maximum"
201+
threshold = "${local.thresholds["MasterJVMMemoryPressureThreshold"]}"
202+
alarm_description = "Elasticsearch JVM memory pressure is too high over last 10 minutes"
203+
alarm_actions = ["${local.aws_sns_topic_arn}"]
204+
ok_actions = ["${local.aws_sns_topic_arn}"]
205+
206+
dimensions {
207+
DomainName = "${var.domain_name}"
208+
ClientId = "${data.aws_caller_identity.default.account_id}"
209+
}
210+
}

0 commit comments

Comments
 (0)