Skip to content

feat: implement CAPI failure domain contract for v1alpha2 zones #717

@emanuelebosetti

Description

@emanuelebosetti

Problem

CAPMOX v0.8.0 introduced zones (zoneConfig) with per-zone IPAM pools, DNS, and network configuration. However, the CAPI failure domain contract is not implemented:

  • ProxmoxCluster.status.failureDomains is not populated
  • ProxmoxMachine.spec.failureDomain field does not exist (InfraMachine contract gap)
  • KubeadmControlPlane cannot automatically distribute control plane nodes across zones
  • MachineDeployments cannot target a specific zone via spec.failureDomain

This means that even with multiple zones configured, operators must manually manage node placement and have no automatic HA distribution of control plane replicas.

Proposal

Implement the minimal CAPI failure domain contract by mapping existing v1alpha2 zones to CAPI failure domains. This stays within a single Proxmox cluster — the multi-DC scenario in #370 is a separate, larger effort for v1alpha3.

API changes (additive, backwards-compatible)

Field Type Description
ZoneConfigSpec.nodes []string Proxmox nodes belonging to this zone. VMs assigned to this failure domain are scheduled only on these nodes.
ZoneConfigSpec.controlPlane *bool Whether this zone is eligible for control plane machines. Defaults to true.
ProxmoxClusterStatus.failureDomains []clusterv1.FailureDomain CAPI v1beta2 contract field, populated from zoneConfig.
ProxmoxMachineSpec.failureDomain string CAPI InfrastructureMachine contract field.

Controller changes

  • reconcileFailureDomains in cluster controller — populates status.failureDomains from zoneConfig (sorted by name to prevent reconcile loops)
  • Machine controller reads Machine.Spec.FailureDomain and scopes AllowedNodes + IPAM pool to the zone
  • Uses in-memory effectiveZone / effectiveAllowedNodes on MachineScope — no spec mutation from the controller
  • Retryable FailureDomainNotReady condition for zone lookup failures (not terminal, requeues with backoff)
  • Full conversion webhook support for v1alpha1 round-trip

Example

spec:
    zoneConfig:
      - zone: "rack-a"
        nodes: ["pve1", "pve2"]
        ipv4Config:                                                                                                                                                                                                                                                                               
          addresses: ["10.0.1.10-10.0.1.30"]
          prefix: 24                                                                                                                                                                                                                                                                              
          gateway: "10.0.1.1"                               
        dnsServers: ["8.8.8.8"]                                                                                                                                                                                                                                                                   
      - zone: "rack-b"                                                                                                                                                                                                                                                                            
        nodes: ["pve3", "pve4"]
        controlPlane: false                                                                                                                                                                                                                                                                       
        ipv4Config:                                                                                                                                                                                                                                                                               
          addresses: ["10.0.2.10-10.0.2.30"]
          prefix: 24                                                                                                                                                                                                                                                                              
          gateway: "10.0.2.1"                               
        dnsServers: ["8.8.8.8"]

KCP with 3 replicas distributes automatically across zones with controlPlane: true. Worker MachineDeployments with failureDomain: "rack-b" are placed on pve3/pve4 with IPs from the rack-b pool.

What this enables

  • KubeadmControlPlane round-robin distribution of CP replicas across zones
  • Per-zone worker MachineDeployments via spec.failureDomain
  • Per-zone IPAM pool selection (already implemented, now wired to the failure domain contract)
  • Worker-only zones via controlPlane: false
  • Full backwards compatibility — clusters without zoneConfig work exactly as before

Relationship to #370

This is the single-cluster foundation for the multi-DC effort. The multi-DC work in #370 (targeted for v1alpha3) would extend this with per-zone credentials and API endpoints via new CRDs, similar to CAPV's VSphereDeploymentZone / VSphereFailureDomain pattern. This implementation
establishes the contract surface that the multi-DC work builds on.

Implementation

We have a working implementation with unit tests and documentation, validated on a real Proxmox cluster (10 nodes, 2 zones) with 3 CP + 2 workers. KCP correctly distributes CP replicas round-robin across failure domains, workers are pinned to their target zone, and IPAM pools are selected per-zone.

Branch: https://github.com/ElmecOSS/cluster-api-provider-proxmox/tree/feat/failure-domains

Single commit on top of v0.8.0. Happy to open a PR against upstream if there's interest.

Environment

  • CAPMOX: v0.8.0 (v1alpha2)
  • CAPI: v1.11.7+ (v1beta2)
  • Proxmox VE: 9.1.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions