Define Gardener/MCM and CAPI workflow for bare-metal worker maintenance

# Summary

We are implementing a maintenance flow and need alignment on the intended CAPI and Gardener/MCM design for workers that enter maintenance on bare metal.

Related:
  - [#1008](https://github.com/gardener/machine-controller-manager/issues/1008)
  - [#1075](https://github.com/gardener/machine-controller-manager/issues/1075)

## Problem

On bare metal, maintenance is not the same as a cloud VM failure. We may need to keep the original machine available for debugging, power it on again, reboot it multiple times, and only return it to normal lifecycle handling once maintenance is explicitly finished.

The open question is how Gardener/MCM and CAPI should model that state.

## User scenarios
- As a member of baremetal support team, I would like to request hardware maintenance for a server and safely perform maintenance.
- As a user of the workload cluster, I would like to request baremetal support team to look into a hardware issue.
- As a user of the workload cluster, I would like to see which nodes in the workload cluster undergo maintenance.

## Option 1: Delete the node from the workload cluster

When maintenance is requested:

- delete the node from the workload cluster
- transition the `Server` to `Maintenance`
- reprovision it as a fresh worker afterwards

### Pros:

- simple lifecycle model
- no need for MCM to retain the failed node

### Cons:

- state on the node is lost (workload cluster node labels), maintenance-controller has no visibility about how many nodes undergo  the maintenance in parallel
- worker pool is under-replicated during maintenance
- Shoot may become unhealthy due to missing workers (CAPI cluster most likely, too)
- with incomplete root-disk cleanup, the node may rejoin unexpectedly and become Ready
- with proper root-disk cleanup, logs/state from the broken node may be lost

## Option 2: Keep the node in the cluster as intentionally unavailable

When maintenance is requested:

- keep the node in the workload cluster
- keep it in a state like NotReady,SchedulingDisabled
- operators can power it on, inspect logs, and reboot during maintenance
- maintenance is explicitly released, after which normal lifecycle handling resumes

### Pros

- preserves access to the original node state for debugging
- supports repeated reboot / investigation cycles
- if the node comes back, it can remain cordoned during maintenance
- conceptually close to machine preservation in [#1059](https://github.com/gardener/machine-controller-manager/pull/1059)

### Cons

- machine preservation needs to account for maintenance state and ignore any timeouts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define Gardener/MCM and CAPI workflow for bare-metal worker maintenance #799

Summary

Problem

User scenarios

Option 1: Delete the node from the workload cluster

Pros:

Cons:

Option 2: Keep the node in the cluster as intentionally unavailable

Pros

Cons

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define Gardener/MCM and CAPI workflow for bare-metal worker maintenance #799

Description

Summary

Problem

User scenarios

Option 1: Delete the node from the workload cluster

Pros:

Cons:

Option 2: Keep the node in the cluster as intentionally unavailable

Pros

Cons

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions