Skip to content

Extensible MachineDeployment Upgrade Controls #12046

Open
@ivelichkovich

Description

@ivelichkovich

What would you like to be added (User Story)?

As an operator I want to be able to control the sequencing of which nodes are upgraded during an upgrade.

Detailed Description

We discussed this a bit at maintainers summit while discussing upgrades with CAPI... It would be nice if we can have custom logic for rolling upgrade node selection when nodes are selected to be removed from the machinedeployment as the upgrade is going on. This would likely require us having custom logic for the delete policy on the MachineDeployments for regular upgrades and custom selection logic for in place upgrades (although I'm less familiar with the mechanism of the in place upgrades).

One concrete use case is I have an AI inference workload that runs in batches/gangs of size 6. Say I want to do an automated rolling upgrade 6 nodes at a time in a fully utilized 36 node cluster.

The best case here, is each batch of 6 nodes being upgraded exactly match to a single gang. In that case I only lose 1/6 of my gangs each time a batch of nodes is upgraded.

Worst case is I lose all 6 of my gangs by losing 1 node from each gang and I take a total 6/6 outage.

This is just one use case, there can be countless others and I don't think CAPI needs to support them all directly. Rather it would be nice to have some mechanism to offload the selection criteria to user logic. A similar example of this is MachineHealthChecks allow creating an arbitrary CRD for remediation, this allows user to create an instance of their own CRD and have their own controller code do the repair. Something similar for upgrade selection would be great.

Potentially something like a webhook? (just one idea)

Anything else you would like to add?

I have folks that would be happy to help work on this

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-priorityIndicates an issue lacks a `priority/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions