Problem
CAPMOX v0.8.0 introduced zones (zoneConfig) with per-zone IPAM pools, DNS, and network configuration. However, the CAPI failure domain contract is not implemented:
ProxmoxCluster.status.failureDomains is not populated
ProxmoxMachine.spec.failureDomain field does not exist (InfraMachine contract gap)
- KubeadmControlPlane cannot automatically distribute control plane nodes across zones
- MachineDeployments cannot target a specific zone via
spec.failureDomain
This means that even with multiple zones configured, operators must manually manage node placement and have no automatic HA distribution of control plane replicas.
Proposal
Implement the minimal CAPI failure domain contract by mapping existing v1alpha2 zones to CAPI failure domains. This stays within a single Proxmox cluster — the multi-DC scenario in #370 is a separate, larger effort for v1alpha3.
API changes (additive, backwards-compatible)
| Field |
Type |
Description |
ZoneConfigSpec.nodes |
[]string |
Proxmox nodes belonging to this zone. VMs assigned to this failure domain are scheduled only on these nodes. |
ZoneConfigSpec.controlPlane |
*bool |
Whether this zone is eligible for control plane machines. Defaults to true. |
ProxmoxClusterStatus.failureDomains |
[]clusterv1.FailureDomain |
CAPI v1beta2 contract field, populated from zoneConfig. |
ProxmoxMachineSpec.failureDomain |
string |
CAPI InfrastructureMachine contract field. |
Controller changes
reconcileFailureDomains in cluster controller — populates status.failureDomains from zoneConfig (sorted by name to prevent reconcile loops)
- Machine controller reads
Machine.Spec.FailureDomain and scopes AllowedNodes + IPAM pool to the zone
- Uses in-memory
effectiveZone / effectiveAllowedNodes on MachineScope — no spec mutation from the controller
- Retryable
FailureDomainNotReady condition for zone lookup failures (not terminal, requeues with backoff)
- Full conversion webhook support for v1alpha1 round-trip
Example
spec:
zoneConfig:
- zone: "rack-a"
nodes: ["pve1", "pve2"]
ipv4Config:
addresses: ["10.0.1.10-10.0.1.30"]
prefix: 24
gateway: "10.0.1.1"
dnsServers: ["8.8.8.8"]
- zone: "rack-b"
nodes: ["pve3", "pve4"]
controlPlane: false
ipv4Config:
addresses: ["10.0.2.10-10.0.2.30"]
prefix: 24
gateway: "10.0.2.1"
dnsServers: ["8.8.8.8"]
KCP with 3 replicas distributes automatically across zones with controlPlane: true. Worker MachineDeployments with failureDomain: "rack-b" are placed on pve3/pve4 with IPs from the rack-b pool.
What this enables
- KubeadmControlPlane round-robin distribution of CP replicas across zones
- Per-zone worker MachineDeployments via spec.failureDomain
- Per-zone IPAM pool selection (already implemented, now wired to the failure domain contract)
- Worker-only zones via controlPlane: false
- Full backwards compatibility — clusters without zoneConfig work exactly as before
Relationship to #370
This is the single-cluster foundation for the multi-DC effort. The multi-DC work in #370 (targeted for v1alpha3) would extend this with per-zone credentials and API endpoints via new CRDs, similar to CAPV's VSphereDeploymentZone / VSphereFailureDomain pattern. This implementation
establishes the contract surface that the multi-DC work builds on.
Implementation
We have a working implementation with unit tests and documentation, validated on a real Proxmox cluster (10 nodes, 2 zones) with 3 CP + 2 workers. KCP correctly distributes CP replicas round-robin across failure domains, workers are pinned to their target zone, and IPAM pools are selected per-zone.
Branch: https://github.com/ElmecOSS/cluster-api-provider-proxmox/tree/feat/failure-domains
Single commit on top of v0.8.0. Happy to open a PR against upstream if there's interest.
Environment
- CAPMOX: v0.8.0 (v1alpha2)
- CAPI: v1.11.7+ (v1beta2)
- Proxmox VE: 9.1.4
Problem
CAPMOX v0.8.0 introduced zones (
zoneConfig) with per-zone IPAM pools, DNS, and network configuration. However, the CAPI failure domain contract is not implemented:ProxmoxCluster.status.failureDomainsis not populatedProxmoxMachine.spec.failureDomainfield does not exist (InfraMachine contract gap)spec.failureDomainThis means that even with multiple zones configured, operators must manually manage node placement and have no automatic HA distribution of control plane replicas.
Proposal
Implement the minimal CAPI failure domain contract by mapping existing v1alpha2 zones to CAPI failure domains. This stays within a single Proxmox cluster — the multi-DC scenario in #370 is a separate, larger effort for v1alpha3.
API changes (additive, backwards-compatible)
ZoneConfigSpec.nodes[]stringZoneConfigSpec.controlPlane*booltrue.ProxmoxClusterStatus.failureDomains[]clusterv1.FailureDomainzoneConfig.ProxmoxMachineSpec.failureDomainstringController changes
reconcileFailureDomainsin cluster controller — populatesstatus.failureDomainsfromzoneConfig(sorted by name to prevent reconcile loops)Machine.Spec.FailureDomainand scopesAllowedNodes+ IPAM pool to the zoneeffectiveZone/effectiveAllowedNodeson MachineScope — no spec mutation from the controllerFailureDomainNotReadycondition for zone lookup failures (not terminal, requeues with backoff)Example
KCP with 3 replicas distributes automatically across zones with controlPlane: true. Worker MachineDeployments with failureDomain: "rack-b" are placed on pve3/pve4 with IPs from the rack-b pool.
What this enables
Relationship to #370
This is the single-cluster foundation for the multi-DC effort. The multi-DC work in #370 (targeted for v1alpha3) would extend this with per-zone credentials and API endpoints via new CRDs, similar to CAPV's VSphereDeploymentZone / VSphereFailureDomain pattern. This implementation
establishes the contract surface that the multi-DC work builds on.
Implementation
We have a working implementation with unit tests and documentation, validated on a real Proxmox cluster (10 nodes, 2 zones) with 3 CP + 2 workers. KCP correctly distributes CP replicas round-robin across failure domains, workers are pinned to their target zone, and IPAM pools are selected per-zone.
Branch: https://github.com/ElmecOSS/cluster-api-provider-proxmox/tree/feat/failure-domains
Single commit on top of v0.8.0. Happy to open a PR against upstream if there's interest.
Environment