Description
When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.
To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.