You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
## Unreleased
1
+
## v0.4.0
2
2
3
3
Upgrade notes:
4
4
* node_id argument change: The nodes array now require a node_id parameter. This id is used to identify and name the node in the cluster and for naming the AWS resources.
For an stateful application cluster, every node needs to have an **unique identity**. This is required sometime to know which node is the leader and which nodes are followers. In other cases it is required to know which node in the cluster has what data. The node identity has to persist even when node is destroyed and recreated. This is completely different that stateless application when it does not matter which node you are talking to, as all nodes are identical.
61
+
For a stateful application cluster, every node needs to have an **unique identity**. This is required sometime to know which node is the leader and which nodes are followers. In other cases it is required to know which node in the cluster has what data. The node identity has to persist even when node is destroyed and recreated. This is completely different that stateless application when it does not matter which node you are talking to, as all nodes are identical.
60
62
The node identity is generally provided with the help of **fixed node IPs** or **fixed hostnames**.
61
63
62
64
#### Cluster Quorum and rolling updates
63
-
For highly available clusters, majority of nodes has to be running. This majority is called as quorum. For an cluster of n nodes, the quorum is represented by n/2 + 1 nodes.
65
+
For highly available clusters, the majority of nodes has to be running. This majority is called as quorum. For a cluster of n nodes, the quorum is represented by n/2 + 1 nodes.
64
66
to provide high availability. The cluster can remain available as long as nodes equals to the quorum are running. In other words, the cluster can service crash of nodes above the quorum size. For example, for a 3 nodes cluster, quorum size is 2 and hence it can service 1 node crash. Similar, 5 node cluster can service 2 node crashes at the same time.
65
67
Any cluster automation assumes that minimum cluster size is 3 to allow a single node crash. This allows the automation to perform rolling update one node at a time. Since cluster size should service single node cluster, the cluster will remain fully operational while performing rolling update.
66
-
For stateful application every decision like rolling updates or replacement of unhealthy nodes etc should be taken at cluster level rather than locally at node level.
68
+
For stateful application every decision like rolling updates or replacement of unhealthy nodes etc. should be taken at cluster level rather than locally at node level.
67
69
68
70
### Module challenges
69
71
1. To solve the **identity problem** of the nodes, a external ENI is created. This ENI is then attached to every node launch template. This ENI retains the IP address even when node is recreated. The new replaced instance will resume the same IP address and hence same identity in the cluster.
@@ -78,7 +80,7 @@ The module will create cluster nodes depending on the cluster size requested (ge
The module creates mainly following resources per node (as shown in the image above)
81
-
- Autoscaling group with min max set to 1.
83
+
- Auto-scaling group with min max set to 1.
82
84
- External Elastic Network interface (ENI) with IP address as requested.
83
85
- Elastic block storage (EBS) as requested.
84
86
- Launch template with user data script to mount the EBS volume and to perform health checks (user data script has 2 parts one part has to be provided by module user. Another part is maintained within the module and calls the user provided script).
@@ -88,7 +90,7 @@ Apart from above resources, the module also include a rolling update script to u
88
90
89
91
When the cluster node is created or is replaced (due to modifications), the ASG lifecycle hook puts the node in a `Pending:Wait` state. The instance will remain in this state unless lifecycle action is not marked as complete (with continue). At the end of module user data script called the complete continue command on the instance lifecycle hook to complete the instance startup process. The userdata script also perform the cluster health check to ensure that node has joined the cluster successfully (this cluster health check function has to provided by the module user which is called from the module userdata script). The cluster health check happens before the lifecycle hook action.
90
92
91
-
This ensures that instance is shown as `InService` only after successful completion on user data script and also checking the cluster health status. The abovementioned rolling update script waits for the instance to be `InService` before updating other instances in the service. The script will timeout for any failed instance which is stuck in `Pending:Wait` state due to failure of the user data script. (Refer to FAQs if this happens). That way, other cluster nodes are not updated with a failed change preventing any downtime (single node failure generally does not cause cluster unavailability due to quorum)
93
+
This ensures that instance is shown as `InService` only after successful completion on user data script and also checking the cluster health status. The above-mentioned rolling update script waits for the instance to be `InService` before updating other instances in the service. The script will timeout for any failed instance which is stuck in `Pending:Wait` state due to failure of the user data script. (Refer to FAQs if this happens). That way, other cluster nodes are not updated with a failed change preventing any downtime (single node failure generally does not cause cluster unavailability due to quorum)
92
94
93
95
94
96
## FAQs
@@ -120,7 +122,7 @@ This should allow the instance to become InService. Rolling script should eventu
120
122
**3. I have more than one instance in non InService state. What should I do?**
121
123
Ideally, the cluster should never get into the state where there are multiple instances failed. This will cause the cluster to be unavailable.
122
124
If the issue has occurred during the terraform rolling update script, then can also be a bug with the script. Please report the issue.
123
-
If the failure has occurred at runtime (not during terraform apply), then ideally instances should be automatically recovered unless infrastucture is manually changed to cause the failure during instance recovery.
125
+
If the failure has occurred at runtime (not during terraform apply), then ideally instances should be automatically recovered unless infrastructure is manually changed to cause the failure during instance recovery.
124
126
Try to follow the FQA 1 and 2 to debug and recover the infrastructure to desired state.
0 commit comments