Skip to content

Commit 311bd0c

Browse files
authored
Merge pull request #19 from AndrewFarley/revert-and-fix-node-disk-space-alarm
Reverting #16, adding information and formatting README, adding new alarm for SUM low disk, standardizing variable names
2 parents 7f45ffe + a43d6ad commit 311bd0c

3 files changed

Lines changed: 185 additions & 130 deletions

File tree

README.md

Lines changed: 58 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ It's 100% Open Source and licensed under the [APACHE2](LICENSE).
1818
|------------|---------------------------|----------|-----------|----------------------------------------------------------------------------------------------------------------------------------------|
1919
| Sharding | ClusterStatus.red | `>=` | 1 | At least one primary shard and its replicas are not allocated to a node |
2020
| Sharding | ClusterStatus.yellow | `>=` | 1 | At least one replica shard is not allocated to a node |
21-
| Storage | FreeStorageSpace | `<=` | 20480 MB | A node in your cluster is down to low storage space. |
21+
| Storage | FreeStorageSpace | `<=` | 20480 MB | A node in your cluster is down to low storage space. Note, this alarm uses the aggregate `Minimum` which means this alarm triggers per-node in your cluster. This logic is based-on the [AWS Recommended Alarms](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cloudwatch-alarms.html). It does not however alarm based on an aggregate of free space remaining. |
22+
| Storage | FreeStorageSpaceTotal | `<=` | 20480 MB | The overall disk space free is low. This alarm uses `Sum` across all your nodes, this can be useful on multi-node clusters. Disabled by default, to enable this you must set `monitor_free_storage_space_total_too_low` to true, and `free_storage_space_total_threshold`. Recommended to set the threshold to the number of nodes in your cluster multiplied by the free_storage_space_threshold |
2223
| Storage | ClusterIndexWritesBlocked | `>=` | 1 | Your cluster is blocking write requests. |
2324
| Node Count | Nodes | `<` | `x` | This alarm indicates that at least one node in your cluster has been unreachable for one day |
2425
| Snapshot | AutomatedSnapshotFailure | `>=` | 1 | An automated snapshot failed. This failure is often the result of a red cluster health status. |
@@ -79,55 +80,62 @@ module "es_alarms" {
7980

8081
## Inputs
8182

82-
| Name | Description | Type | Default | Required |
83-
|-----------------------------------------------|-------------|:----:|:-------:|:--------:|
84-
| `domain_name` | The Elasticserach domain name you want to monitor. | string | - | yes |
85-
| `cluster_type` | The type of cluster, single or multi-node | string | `"single"` | no |
86-
| `monitor_cluster_status_is_red_periods` | The number of periods to alert that cluster status is red, raise this to be less noisy | number | `1` | no |
87-
| `alarm_cluster_status_is_yellow_periods` | The number of periods before triggering the cluster status is yellow, raise this to be less noisy | number | `1` | no |
88-
| `alarm_free_storage_space_too_low_periods` | The number of periods before triggering the disk space is low, raise this to be less noisy | number | `1` | no |
89-
| `monitor_cluster_index_writes_blocked_periods` | The number of periods to alert that cluster index writes are blocked, raise this if desired to make less noisy | number | `1` | no |
90-
| `monitor_min_available_nodes_periods` | The number of periods to alert that minimum number of available nodes dropped below a threshold, raise this if desired to make less noisy | number | `1` | no |
91-
| `monitor_automated_snapshot_failure_periods` | The number of periods to alert that automatic snapshots failed, raise this if desired to make less noisy | number | `1` | no |
92-
| `monitor_cpu_utilization_too_high_periods` | The number of periods to alert that CPU usage is too high, raise this if desired to make less noisy | number | `3` | no |
93-
| `monitor_jvm_memory_pressure_too_high_periods` | The number of periods which it must be in the alarmed state to alert, raise this if desired to make less noisy | number | `1` | no |
94-
| `monitor_master_cpu_utilization_too_high_periods` | The number of periods to alert that masters CPU usage is too high, raise this if desired to make less noisy | number | `3` | no |
95-
| `monitor_master_jvm_memory_pressure_too_high_periods` | The number of periods which it must be in the alarmed state to alert, raise this if desired to make less noisy | number | `1` | no |
96-
| `monitor_kms_periods` | The number of periods to alert that kms has failed, raise this if desired to make less noisy | number | `1` | no |
97-
| `alarm_name_postfix` | Alarm name postfix | string | `""` | no |
98-
| `alarm_name_prefix` | Alarm name prefix | string | `""` | no |
99-
| `cpu_utilization_threshold` | The maximum percentage of CPU utilization | string | `80` | no |
100-
| `free_storage_space_threshold` | The minimum amount of available storage space in MiB. | string | `20480` | no |
101-
| `jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for all data nodes in the cluster | string | `80` | no |
102-
| `master_cpu_utilization_threshold` | The maximum percentage of CPU utilization of master nodes | string | `""` | no |
103-
| `master_jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for master nodes in the cluster | string | `""` | no |
104-
| `min_available_nodes` | The minimum available (reachable) nodes to have, set to non-zero to enable alarm | string | `0` | no |
105-
| `monitor_automated_snapshot_failure` | Enable monitoring of automated snapshot failure | bool | `true` | no |
106-
| `monitor_cluster_index_writes_blocked` | Enable monitoring of cluster index writes being blocked | bool | `true` | no |
107-
| `monitor_cluster_status_is_red` | Enable monitoring of cluster status is in red | bool | `true` | no |
108-
| `monitor_cluster_status_is_yellow` | Enable monitoring of cluster status is in yellow | bool | `true` | no |
109-
| `monitor_cpu_utilization_too_high` | Enable monitoring of CPU utilization is too high | bool | `true` | no |
110-
| `monitor_free_storage_space_too_low` | Enable monitoring of cluster average free storage is to low | bool | `true` | no |
111-
| `monitor_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure is too high | bool | `true` | no |
112-
| `monitor_kms` | Enable monitoring of KMS-related metrics, enable if using KMS | bool | `false` | no |
113-
| `monitor_master_cpu_utilization_too_high` | Enable monitoring of CPU utilization of master nodes are too high. Only enable this when dedicated master is enabled | bool | `false` | no |
114-
| `monitor_master_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure of master nodes are too high. Only enable this wwhen dedicated master is enabled | bool | `false` | no |
115-
| `monitor_min_available_nodes_period` | The period of the minimum available nodes should the statistics be applied in seconds | string | `86400` | no |
116-
| `monitor_automated_snapshot_failure_period` | The period of the automated snapshot failure should the statistics be applied in seconds | string | `60` | no |
117-
| `monitor_cluster_index_writes_blocked_period` | The period of the cluster index writes being blocked should the statistics be applied in seconds | string | `300` | no |
118-
| `monitor_cluster_status_is_red_period` | The period of the cluster status is in red should the statistics be applied in seconds | string | `60` | no |
119-
| `monitor_cluster_status_is_yellow_period` | The period of the cluster status is in yellow should the statistics be applied in seconds | string | `60` | no |
120-
| `monitor_cpu_utilization_too_high_period` | The period of the CPU utilization is too high should the statistics be applied in seconds | string | `900` | no |
121-
| `monitor_free_storage_space_too_low_period` | The period of the cluster average free storage is too low should the statistics be applied in seconds | string | `60` | no |
122-
| `monitor_jvm_memory_pressure_too_high_period` | The period of the JVM memory pressure is too high should the statistics be applied in seconds | string | `900` | no |
123-
| `monitor_kms_period` | The period of the KMS-related metrics should the statistics be applied in seconds | string | `60` | no |
124-
| `monitor_master_cpu_utilization_too_high_period` | The period of the CPU utilization of master nodes are too high should the statistics be applied in seconds | string | `900` | no |
125-
| `monitor_master_jvm_memory_pressure_too_high_period` | The period of the JVM memory pressure of master nodes are too high should the statistics be applied in seconds | string | `900` | no |
126-
| `create_sns_topic` | Will create an SNS topic, if you set this to false you MUST set `sns_topic` to a FULL ARN | bool | `true` | no |
127-
| `sns_topic` | SNS topic you want to specify. If leave empty, it will use a prefix and a timestamp appended. If `create_sns_topic` is set to false, this MUST be a FULL ARN | string | `""` | no |
128-
| `sns_topic_postfix` | SNS topic postfix | string | `""` | no |
129-
| `sns_topic_prefix` | SNS topic prefix | string | `""` | no |
130-
| `tags` | Tags to associate with all created resources | map | `{}` | no |
83+
| Name | Description | Type | Default | Required |
84+
|------------------------------------------------------|-------------|:----:|:-------:|:--------:|
85+
| `domain_name` | The Elasticserach domain name you want to monitor. | string | - | yes |
86+
| `cluster_type` | The type of cluster, single or multi-node | string | `"single"` | no |
87+
| `alarm_name_postfix` | Alarm name postfix | string | `""` | no |
88+
| `alarm_name_prefix` | Alarm name prefix | string | `""` | no |
89+
| `create_sns_topic` | Will create an SNS topic, if you set this to false you MUST set `sns_topic` to a FULL ARN | bool | `true` | no |
90+
| `sns_topic` | SNS topic you want to specify. If leave empty, it will use a prefix and a timestamp appended. If `create_sns_topic` is set to false, this MUST be a FULL ARN | string | `""` | no |
91+
| `sns_topic_postfix` | SNS topic postfix | string | `""` | no |
92+
| `sns_topic_prefix` | SNS topic prefix | string | `""` | no |
93+
| `tags` | Tags to associate with all created resources | map | `{}` | no |
94+
| `cpu_utilization_threshold` | The maximum percentage of CPU utilization | string | `80` | no |
95+
| `free_storage_space_threshold` | The minimum amount of available storage space in MiB. | string | `20480` | no |
96+
| `jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for all data nodes in the cluster | string | `80` | no |
97+
| `master_cpu_utilization_threshold` | The maximum percentage of CPU utilization of master nodes | string | `""` | no |
98+
| `master_jvm_memory_pressure_threshold` | The maximum percentage of the Java heap used for master nodes in the cluster | string | `""` | no |
99+
| `min_available_nodes` | The minimum available (reachable) nodes to have, set to non-zero to enable alarm | string | `0` | no |
100+
101+
| `monitor_automated_snapshot_failure` | Enable monitoring of automated snapshot failure | bool | `true` | no |
102+
| `monitor_cluster_status_is_red` | Enable monitoring of cluster status is in red | bool | `true` | no |
103+
| `monitor_cluster_status_is_yellow` | Enable monitoring of cluster status is in yellow | bool | `true` | no |
104+
| `monitor_cluster_index_writes_blocked` | Enable monitoring of cluster index writes being blocked | bool | `true` | no |
105+
| `monitor_cpu_utilization_too_high` | Enable monitoring of CPU utilization is too high | bool | `true` | no |
106+
| `monitor_free_storage_space_too_low` | Enable monitoring of minimum per-node free storage is too low | bool | `true` | no |
107+
| `monitor_free_storage_space_total_too_low` | Enable monitoring of cluster total free storage is too low | bool | `false` | no |
108+
| `monitor_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure is too high | bool | `true` | no |
109+
| `monitor_kms` | Enable monitoring of KMS-related metrics, enable if using KMS | bool | `false` | no |
110+
| `monitor_master_cpu_utilization_too_high` | Enable monitoring of CPU utilization of master nodes are too high. Only enable this when dedicated master is enabled | bool | `false` | no |
111+
| `monitor_master_jvm_memory_pressure_too_high` | Enable monitoring of JVM memory pressure of master nodes are too high. Only enable this wwhen dedicated master is enabled | bool | `false` | no |
112+
| `monitor_min_available_nodes` | Enable monitoring of minimum available nodes | bool | `true` | no |
113+
114+
| `alarm_automated_snapshot_failure_periods` | The number of periods to alert that automatic snapshots failed, raise this if desired to make less noisy | number | `1` | no |
115+
| `alarm_cluster_status_is_red_periods` | The number of periods to alert that cluster status is red, raise this to be less noisy | number | `1` | no |
116+
| `alarm_cluster_status_is_yellow_periods` | The number of periods before triggering the cluster status is yellow, raise this to be less noisy | number | `1` | no |
117+
| `alarm_cluster_index_writes_blocked_periods` | The number of periods to alert that cluster index writes are blocked, raise this if desired to make less noisy | number | `1` | no |
118+
| `alarm_cpu_utilization_too_high_periods` | The number of periods to alert that CPU usage is too high, raise this if desired to make less noisy | number | `3` | no |
119+
| `alarm_free_storage_space_too_low_periods` | The number of periods before triggering the disk space is low, raise this to be less noisy | number | `1` | no |
120+
| `alarm_free_storage_space_total_too_low_periods` | The number of periods before triggering the total disk space is low, raise this to be less noisy | number | `1` | no |
121+
| `alarm_jvm_memory_pressure_too_high_periods` | The number of periods which it must be in the alarmed state to alert, raise this if desired to make less noisy | number | `1` | no |
122+
| `alarm_kms_periods` | The number of periods to alert that kms has failed, raise this if desired to make less noisy | number | `1` | no |
123+
| `alarm_master_cpu_utilization_too_high_periods` | The number of periods to alert that masters CPU usage is too high, raise this if desired to make less noisy | number | `3` | no |
124+
| `alarm_master_jvm_memory_pressure_too_high_periods` | The number of periods which it must be in the alarmed state to alert, raise this if desired to make less noisy | number | `1` | no |
125+
| `alarm_min_available_nodes_periods` | The number of periods to alert that minimum number of available nodes dropped below a threshold, raise this if desired to make less noisy | number | `1` | no |
126+
127+
| `alarm_min_available_nodes_period` | The period of the minimum available nodes should the statistics be applied in seconds | string | `86400` | no |
128+
| `alarm_automated_snapshot_failure_period` | The period of the automated snapshot failure should the statistics be applied in seconds | string | `60` | no |
129+
| `alarm_cluster_index_writes_blocked_period` | The period of the cluster index writes being blocked should the statistics be applied in seconds | string | `300` | no |
130+
| `alarm_cluster_status_is_red_period` | The period of the cluster status is in red should the statistics be applied in seconds | string | `60` | no |
131+
| `alarm_cluster_status_is_yellow_period` | The period of the cluster status is in yellow should the statistics be applied in seconds | string | `60` | no |
132+
| `alarm_cpu_utilization_too_high_period` | The period of the CPU utilization is too high should the statistics be applied in seconds | string | `900` | no |
133+
| `alarm_free_storage_space_too_low_period` | The period of the per-node minimum free storage is too low should the statistics be applied in seconds | string | `60` | no |
134+
| `alarm_free_storage_space_total_too_low_period` | The period of the cluster total free storage is too low should the statistics be applied in seconds | string | `60` | no |
135+
| `alarm_jvm_memory_pressure_too_high_period` | The period of the JVM memory pressure is too high should the statistics be applied in seconds | string | `900` | no |
136+
| `alarm_kms_period` | The period of the KMS-related metrics should the statistics be applied in seconds | string | `60` | no |
137+
| `alarm_master_cpu_utilization_too_high_period` | The period of the CPU utilization of master nodes are too high should the statistics be applied in seconds | string | `900` | no |
138+
| `alarm_master_jvm_memory_pressure_too_high_period` | The period of the JVM memory pressure of master nodes are too high should the statistics be applied in seconds | string | `900` | no |
131139

132140
## Outputs
133141

0 commit comments

Comments
 (0)