Skip to content

Commit a79fd00

Browse files
Merge pull request #14 from prashantkalkar/ReleasePrep
Release prep
2 parents ff1443b + 682f80e commit a79fd00

File tree

2 files changed

+10
-8
lines changed

2 files changed

+10
-8
lines changed

CHANGELOG.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Unreleased
1+
## v0.4.0
22

33
Upgrade notes:
44
* node_id argument change: The nodes array now require a node_id parameter. This id is used to identify and name the node in the cluster and for naming the AWS resources.

README.md

+9-7
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,19 @@ module "cluster" {
2222
nodes = [
2323
{
2424
node_ip = "<InstanceIPToBeAllocated>"
25+
node_id = "<NodeId>" # should be unique
2526
node_subnet_id = "<subnet_id>"
2627
node_files_toupload = [filebase64("${path.module}/config_file.cfg")]
2728
},
2829
{
2930
node_ip = "<InstanceIPToBeAllocated>"
31+
node_id = "<NodeId>"
3032
node_subnet_id = "<subnet_id>"
3133
node_files_toupload = [filebase64("${path.module}/config_file.cfg")]
3234
},
3335
{
3436
node_ip = "<InstanceIPToBeAllocated>"
37+
node_id = "<NodeId>"
3538
node_subnet_id = "<subnet_id>"
3639
node_files_toupload = [filebase64("${path.module}/config_file.cfg")]
3740
}
@@ -48,22 +51,21 @@ module "cluster" {
4851
type = "gp3"
4952
mount_params = ["noatime"]
5053
}
51-
node_image = "<ami_id>"
5254
}
5355
```
5456

5557
## Why this module exists
5658
### A bit about stateful application
5759

5860
#### Node identity
59-
For an stateful application cluster, every node needs to have an **unique identity**. This is required sometime to know which node is the leader and which nodes are followers. In other cases it is required to know which node in the cluster has what data. The node identity has to persist even when node is destroyed and recreated. This is completely different that stateless application when it does not matter which node you are talking to, as all nodes are identical.
61+
For a stateful application cluster, every node needs to have an **unique identity**. This is required sometime to know which node is the leader and which nodes are followers. In other cases it is required to know which node in the cluster has what data. The node identity has to persist even when node is destroyed and recreated. This is completely different that stateless application when it does not matter which node you are talking to, as all nodes are identical.
6062
The node identity is generally provided with the help of **fixed node IPs** or **fixed hostnames**.
6163

6264
#### Cluster Quorum and rolling updates
63-
For highly available clusters, majority of nodes has to be running. This majority is called as quorum. For an cluster of n nodes, the quorum is represented by n/2 + 1 nodes.
65+
For highly available clusters, the majority of nodes has to be running. This majority is called as quorum. For a cluster of n nodes, the quorum is represented by n/2 + 1 nodes.
6466
to provide high availability. The cluster can remain available as long as nodes equals to the quorum are running. In other words, the cluster can service crash of nodes above the quorum size. For example, for a 3 nodes cluster, quorum size is 2 and hence it can service 1 node crash. Similar, 5 node cluster can service 2 node crashes at the same time.
6567
Any cluster automation assumes that minimum cluster size is 3 to allow a single node crash. This allows the automation to perform rolling update one node at a time. Since cluster size should service single node cluster, the cluster will remain fully operational while performing rolling update.
66-
For stateful application every decision like rolling updates or replacement of unhealthy nodes etc should be taken at cluster level rather than locally at node level.
68+
For stateful application every decision like rolling updates or replacement of unhealthy nodes etc. should be taken at cluster level rather than locally at node level.
6769

6870
### Module challenges
6971
1. To solve the **identity problem** of the nodes, a external ENI is created. This ENI is then attached to every node launch template. This ENI retains the IP address even when node is recreated. The new replaced instance will resume the same IP address and hence same identity in the cluster.
@@ -78,7 +80,7 @@ The module will create cluster nodes depending on the cluster size requested (ge
7880
![alt text](https://github.com/prashantkalkar/stateful_application_module/blob/main/_docs/architecture.png?raw=true)
7981

8082
The module creates mainly following resources per node (as shown in the image above)
81-
- Auto scaling group with min max set to 1.
83+
- Auto-scaling group with min max set to 1.
8284
- External Elastic Network interface (ENI) with IP address as requested.
8385
- Elastic block storage (EBS) as requested.
8486
- Launch template with user data script to mount the EBS volume and to perform health checks (user data script has 2 parts one part has to be provided by module user. Another part is maintained within the module and calls the user provided script).
@@ -88,7 +90,7 @@ Apart from above resources, the module also include a rolling update script to u
8890

8991
When the cluster node is created or is replaced (due to modifications), the ASG lifecycle hook puts the node in a `Pending:Wait` state. The instance will remain in this state unless lifecycle action is not marked as complete (with continue). At the end of module user data script called the complete continue command on the instance lifecycle hook to complete the instance startup process. The userdata script also perform the cluster health check to ensure that node has joined the cluster successfully (this cluster health check function has to provided by the module user which is called from the module userdata script). The cluster health check happens before the lifecycle hook action.
9092

91-
This ensures that instance is shown as `InService` only after successful completion on user data script and also checking the cluster health status. The above mentioned rolling update script waits for the instance to be `InService` before updating other instances in the service. The script will timeout for any failed instance which is stuck in `Pending:Wait` state due to failure of the user data script. (Refer to FAQs if this happens). That way, other cluster nodes are not updated with a failed change preventing any downtime (single node failure generally does not cause cluster unavailability due to quorum)
93+
This ensures that instance is shown as `InService` only after successful completion on user data script and also checking the cluster health status. The above-mentioned rolling update script waits for the instance to be `InService` before updating other instances in the service. The script will timeout for any failed instance which is stuck in `Pending:Wait` state due to failure of the user data script. (Refer to FAQs if this happens). That way, other cluster nodes are not updated with a failed change preventing any downtime (single node failure generally does not cause cluster unavailability due to quorum)
9294

9395

9496
## FAQs
@@ -120,7 +122,7 @@ This should allow the instance to become InService. Rolling script should eventu
120122
**3. I have more than one instance in non InService state. What should I do?**
121123
Ideally, the cluster should never get into the state where there are multiple instances failed. This will cause the cluster to be unavailable.
122124
If the issue has occurred during the terraform rolling update script, then can also be a bug with the script. Please report the issue.
123-
If the failure has occurred at runtime (not during terraform apply), then ideally instances should be automatically recovered unless infrastucture is manually changed to cause the failure during instance recovery.
125+
If the failure has occurred at runtime (not during terraform apply), then ideally instances should be automatically recovered unless infrastructure is manually changed to cause the failure during instance recovery.
124126
Try to follow the FQA 1 and 2 to debug and recover the infrastructure to desired state.
125127

126128
## References:

0 commit comments

Comments
 (0)