Summary
We are implementing a maintenance flow and need alignment on the intended CAPI and Gardener/MCM design for workers that enter maintenance on bare metal.
Related:
Problem
On bare metal, maintenance is not the same as a cloud VM failure. We may need to keep the original machine available for debugging, power it on again, reboot it multiple times, and only return it to normal lifecycle handling once maintenance is explicitly finished.
The open question is how Gardener/MCM and CAPI should model that state.
User scenarios
- As a member of baremetal support team, I would like to request hardware maintenance for a server and safely perform maintenance.
- As a user of the workload cluster, I would like to request baremetal support team to look into a hardware issue.
- As a user of the workload cluster, I would like to see which nodes in the workload cluster undergo maintenance.
Option 1: Delete the node from the workload cluster
When maintenance is requested:
- delete the node from the workload cluster
- transition the
Server to Maintenance
- reprovision it as a fresh worker afterwards
Pros:
- simple lifecycle model
- no need for MCM to retain the failed node
Cons:
- state on the node is lost (workload cluster node labels), maintenance-controller has no visibility about how many nodes undergo the maintenance in parallel
- worker pool is under-replicated during maintenance
- Shoot may become unhealthy due to missing workers (CAPI cluster most likely, too)
- with incomplete root-disk cleanup, the node may rejoin unexpectedly and become Ready
- with proper root-disk cleanup, logs/state from the broken node may be lost
Option 2: Keep the node in the cluster as intentionally unavailable
When maintenance is requested:
- keep the node in the workload cluster
- keep it in a state like NotReady,SchedulingDisabled
- operators can power it on, inspect logs, and reboot during maintenance
- maintenance is explicitly released, after which normal lifecycle handling resumes
Pros
- preserves access to the original node state for debugging
- supports repeated reboot / investigation cycles
- if the node comes back, it can remain cordoned during maintenance
- conceptually close to machine preservation in #1059
Cons
- machine preservation needs to account for maintenance state and ignore any timeouts
Summary
We are implementing a maintenance flow and need alignment on the intended CAPI and Gardener/MCM design for workers that enter maintenance on bare metal.
Related:
Problem
On bare metal, maintenance is not the same as a cloud VM failure. We may need to keep the original machine available for debugging, power it on again, reboot it multiple times, and only return it to normal lifecycle handling once maintenance is explicitly finished.
The open question is how Gardener/MCM and CAPI should model that state.
User scenarios
Option 1: Delete the node from the workload cluster
When maintenance is requested:
ServertoMaintenancePros:
Cons:
Option 2: Keep the node in the cluster as intentionally unavailable
When maintenance is requested:
Pros
Cons