-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Request early propagation of controlPlaneEndpoint to break Talos bootstrap deadlock
Proposed Enhancement
To be added
CAPC should propagate CloudStackCluster.spec.controlPlaneEndpoint.host → Cluster.status.controlPlaneEndpoint.host immediately after public IP allocation, before waiting for VMs to reach "Running" state.
Why is this needed?
Creates chicken-egg deadlock with Talos bootstrap provider:
CAPC allocates public IP → CloudStackCluster.spec.controlPlaneEndpoint.host = "1.2.3.4" ✅
TalosConfigTemplate generates → certSANs: [] (reads Cluster.status → empty) ❌
VMs boot with broken configs → stuck "Starting" in CloudStack ❌
CAPC waits for "Running" → never populates Cluster.status → deadlock
Current logs prove this exact scenario:
"CloudStackCluster.spec.controlPlaneEndpoint.host": "1.2.3.4" ✅
"Instance not ready, is Starting" ❌
Cluster.status.controlPlaneEndpoint.host: "" ❌
certSANs: [] ❌
How does current CAPC work?
Public IP allocated → VMs "Running" → THEN Cluster.status updated
How should it work?
Public IP allocated → Cluster.status updated → Talos certSANs populated → VMs boot → LB rules
Precedents (Other Providers)
✅ CAPZ: Updates after LB IP allocation (before nodes ready)
✅ CAPM3: Updates after IPAM (before bare metal provisioned)
✅ CAPV: Updates after load balancer creation (before VMs ready)
❌ CAPC: Waits for VMs "Running" (breaks Talos bootstrap)
Impact
- Talos provider: Fully automatic bootstrap (no manual certSANs patches)
- Production templates: Dynamic IP allocation works end-to-end
- Existing clusters: No breaking changes (idempotent status update)
Risk Assessment
✅ CAPI InfraCluster contract compliant
✅ Idempotent - safe retry
✅ Matches other provider timing patterns
✅ No impact on kubeadm/other bootstrap providers
✅ Testable with existing e2e framework