Skip to content

Add custom delay for instance refresh actions #465

Open
@stevehipwell

Description

@stevehipwell

When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.

To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Priority: MediumThis issue will be seen by about half of usersType: EnhancementNew feature or requeststalebot-ignoreTo NOT let the stalebot update or close the Issue / PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions