Skip to content

Define Gardener/MCM and CAPI workflow for bare-metal worker maintenance #799

@defo89

Description

@defo89

Summary

We are implementing a maintenance flow and need alignment on the intended CAPI and Gardener/MCM design for workers that enter maintenance on bare metal.

Related:

Problem

On bare metal, maintenance is not the same as a cloud VM failure. We may need to keep the original machine available for debugging, power it on again, reboot it multiple times, and only return it to normal lifecycle handling once maintenance is explicitly finished.

The open question is how Gardener/MCM and CAPI should model that state.

User scenarios

  • As a member of baremetal support team, I would like to request hardware maintenance for a server and safely perform maintenance.
  • As a user of the workload cluster, I would like to request baremetal support team to look into a hardware issue.
  • As a user of the workload cluster, I would like to see which nodes in the workload cluster undergo maintenance.

Option 1: Delete the node from the workload cluster

When maintenance is requested:

  • delete the node from the workload cluster
  • transition the Server to Maintenance
  • reprovision it as a fresh worker afterwards

Pros:

  • simple lifecycle model
  • no need for MCM to retain the failed node

Cons:

  • state on the node is lost (workload cluster node labels), maintenance-controller has no visibility about how many nodes undergo the maintenance in parallel
  • worker pool is under-replicated during maintenance
  • Shoot may become unhealthy due to missing workers (CAPI cluster most likely, too)
  • with incomplete root-disk cleanup, the node may rejoin unexpectedly and become Ready
  • with proper root-disk cleanup, logs/state from the broken node may be lost

Option 2: Keep the node in the cluster as intentionally unavailable

When maintenance is requested:

  • keep the node in the workload cluster
  • keep it in a state like NotReady,SchedulingDisabled
  • operators can power it on, inspect logs, and reboot during maintenance
  • maintenance is explicitly released, after which normal lifecycle handling resumes

Pros

  • preserves access to the original node state for debugging
  • supports repeated reboot / investigation cycles
  • if the node comes back, it can remain cordoned during maintenance
  • conceptually close to machine preservation in #1059

Cons

  • machine preservation needs to account for maintenance state and ignore any timeouts

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Backlog

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions