Cluster API Provider Azure (CAPZ) is a Kubernetes-native declarative infrastructure provider for managing Azure clusters. It implements the Cluster API (CAPI) specification for both self-managed (IaaS) and AKS-managed Kubernetes clusters on Azure.
Key Concepts:
- CAPI: Cluster API - the upstream Kubernetes project defining cluster lifecycle APIs
- CAPZ: Cluster API Provider Azure - this repository
- ASO: Azure Service Operator - used for declarative Azure resource management
- Reconciler: Controller-runtime pattern for managing Kubernetes custom resources
-
API Definitions (
/apiand/exp/api)v1beta1: Stable API version for core resourcesv1alpha1: Deprecated, use v1beta1/exp: Experimental features (MachinePools, Managed clusters)- Key resources: AzureCluster, AzureMachine, AzureMachinePool, AzureManagedCluster, AzureManagedControlPlane
-
Controllers (
/controllersand/exp/controllers)- Each controller reconciles a specific custom resource type
- Controllers use Azure SDK and ASO to manage Azure infrastructure
- Reconciliation pattern: observe state → determine actions → apply changes → update status
- Key controllers: AzureClusterReconciler, AzureMachineReconciler, AzureMachinePoolReconciler
-
Azure Services Layer (
/azure/services)- Service-specific Azure API clients organized by Azure resource type
- Examples:
virtualnetworks,subnets,loadbalancers,virtualmachines,vmss - Each service implements the
Reconcilerinterface:Reconcile(),Delete() - Services handle Azure API calls, credential management, and error handling
-
Scope Package (
/azure/scope)- Provides context and configuration for controllers
- Scopes encapsulate cluster/machine specs, credentials, and Azure clients
- Key scopes: ClusterScope, MachineScope, MachinePoolScope, ManagedControlPlaneScope
-
Feature Gates (
/feature)- Controls experimental/optional functionality
- Important gates:
MachinePool,ASOAPI,EdgeZone,ClusterResourceSet
User creates K8s resource → Controller watches → Reconciler triggered →
Scope created → Azure service methods called → Azure API interactions →
Status updated → Requeue if needed
- Self-Managed (IaaS): CAPZ creates VMs, networks, load balancers via Azure APIs
- Managed (AKS): CAPZ creates/manages AKS clusters and agent pools via ASO
# Build the manager binary
make manager
# Run unit tests
make test
# Run unit tests with race detector
make test-cover
# Run linting
make lint
# Fix lint issues automatically
make lint-fix
# Generate code (deepcopy, CRDs, webhooks, mocks)
make generate
# Verify all generated files are up-to-date
make verify# Run specific test by name
KUBEBUILDER_ASSETS="$(make setup-envtest 2>&1 | grep -o '/.*')" go test -v -run TestFunctionName ./path/to/package
# Run all tests in a package
KUBEBUILDER_ASSETS="$(make setup-envtest 2>&1 | grep -o '/.*')" go test -v ./controllers/# Build controller image (defaults to dev tag)
make docker-build
# Build and push to registry
REGISTRY=myregistry.io make docker-build docker-push
# Build all architectures
make docker-build-all# Create Kind cluster and start Tilt
make kind-create tilt-up
# Use AKS as management cluster instead
make aks-create tilt-up
# Delete Kind cluster
make kind-resettilt-settings.yaml is required with Azure credentials (see docs/book/src/developers/development.md for details).
# Run E2E tests (requires Azure credentials in env)
./scripts/ci-e2e.sh
# Run conformance tests
./scripts/ci-conformance.shControllers follow this pattern:
- Fetch the resource being reconciled
- Check deletion timestamp; run finalizer logic if deleting
- Create/update scope with credentials and config
- Call service Reconcile() or Delete() methods
- Update resource status
- Return result with requeue if needed
- Transient errors: Return
ctrl.Result{RequeueAfter: timeout}to retry - Permanent errors: Log error, update status conditions, don't requeue
- Long-running operations: Use Azure async patterns with futures
- Define API in
/api/v1beta1or/exp/api/v1beta1 - Run
make generateto create deepcopy methods - Create controller in
/controllers - Create service in
/azure/services/<resourcetype> - Update scope if needed in
/azure/scope - Add webhooks in
/api/v1beta1for validation/defaulting - Generate CRDs with
make generate-manifests - Add tests
CAPZ supports multiple authentication methods:
- Service Principal (client ID/secret)
- Managed Identity (system or user-assigned)
- Workload Identity (recommended for AKS)
Credentials are cached in azure.CredentialCache and scoped to cluster identity.
Several code generators are used:
make generate-go # controller-gen for deepcopy, conversion-gen
make generate-manifests # CRDs, RBAC, webhooks
make generate-flavors # Cluster template flavors
make generate-addons # CNI and other addonsAlways run after changing:
- API types (add fields, change validation)
- RBAC annotations
- Webhook logic
- Flavor templates
- Unit Tests: Test individual functions/methods with mocks
- Integration Tests: Test controller logic with fake Kubernetes client
- E2E Tests: Deploy real clusters in Azure, verify functionality
- Conformance Tests: Run upstream Kubernetes conformance suite
GoMock is used for Azure client mocks:
make generate-go # Regenerates mocks in azure/services/*/mock_*/main.go: Entry point, registers controllers and webhooksMakefile: All build/test/dev targetsTiltfile: Local development with Tiltgo.mod: Go dependencies (uses Go 1.24+)config/: Kustomize configurations for CRDs, RBAC, webhooks, managertemplates/: Cluster template flavors for different scenariostest/e2e/: E2E test suites and data files
The project uses golangci-lint with strict settings. If you get lint errors:
- Run
make lint-fixfirst - Check
.golangci.ymlfor enabled linters - Some issues require manual fixes (cognitive complexity, etc.)
- envtest issues: Ensure KUBEBUILDER_ASSETS is set correctly
- Race detector failures: May indicate real concurrency bugs
- Flaky E2E tests: Azure API throttling or transient infrastructure issues
If make verify fails:
- Run
make generate - Commit the generated files
- Do not manually edit generated files
Features under /exp require feature gates to be enabled:
- Set in environment:
export EXP_MACHINE_POOL=true - Documented in
/feature/feature.go
For managed clusters, CAPZ uses ASO to create Azure resources declaratively:
- ASO CRDs are vendored in
config/aso/ - Controllers create ASO resources which ASO reconciles to Azure
- CAPZ watches ASO resources and updates CAPI status based on ASO status
- Request coalescing:
/pkg/coalescingdeduplicates rapid reconcile requests - Credential caching: Avoids repeated Azure auth calls
- Concurrency: Configured via flags like
--azurecluster-concurrency - Timeouts: Service reconcile timeout, Azure API call timeout (configurable)
Primary documentation is in /docs/book/src/ (mdBook format):
- Getting started guides
- Developer documentation
- Troubleshooting guides
- API reference
Build docs: make -C docs/book serve (requires mdBook)