This document explains how the multigres-operator manages PostgreSQL pool pods directly, replacing the original StatefulSet-based approach. It covers the "why", "how", code structure, current feature set, and known gaps.
- Overview
- Why Direct Pod Management
- Resource Topology
- Pod Lifecycle
- Drain State Machine
- Rolling Updates
- Spec-Hash Drift Detection
- PVC Management
- Pod Disruption Budgets
- Status Aggregation
- Feature Summary
- Code Organization
- Naming Conventions
- Interaction with Multigres
- Known Gaps and Future Work
The operator manages individual Pods and PersistentVolumeClaims for each pool replica instead of delegating to Kubernetes StatefulSets. The shard controller (pkg/resource-handler/controller/shard/) owns Pod, PVC, PDB, headless Service, and ConfigMap resources directly, reconciling them on every loop.
This architectural change was motivated by the fact that multigres has its own identity and discovery system (etcd topology) that is completely independent of Kubernetes StatefulSet identity. The operator previously used ParallelPodManagement and did not leverage ordered deployment or scaling — the two main features that StatefulSets provide over Deployments.
Scheduling Note: With parallel pod creation, all pods for a pool are submitted in the same reconcile pass. For backup volumes using
WaitForFirstConsumer+ RWO storage (e.g., EBS gp2), if multiple replicas are created simultaneously, the scheduler may place them on different nodes, causingMulti-Attacherrors on the shared backup PVC. See PVC Management § Shared Backup PVC for details and recommended solutions (S3 or RWX storage).
| Problem | Impact |
|---|---|
| No targeted decommissioning | StatefulSets can only scale from ordinal N-1 downward. Deleting a specific replica (e.g., one whose data is corrupt or whose zone is draining) required external hacks. |
| No independent zone/cell management | Each cell already needed its own StatefulSet. The operator gained no simplification from StatefulSet identity guarantees. |
| PVC lifecycle is opaque | The only mechanism was PersistentVolumeClaimRetentionPolicy, which is limited and version-dependent. |
| Rename/replace ambiguity | StatefulSet ordinal identity conflicted with multigres's own etcd-based identity. Replacing a pod meant fighting the controller. |
| GitOps drift | StatefulSets report replicas: N in status, causing constant "configured" drift when using kubectl apply. |
| Blocked rolling updates | StatefulSet rolling update strategy (OnDelete or RollingUpdate) doesn't coordinate with multigres replication — it can't remove a standby from synchronous_standby_names before killing the pod. |
| Benefit | Detail |
|---|---|
| Targeted decommissioning | Any specific pod can be drained and removed without affecting others. |
| Simpler PVC lifecycle | PVC creation/deletion is handled directly in the reconcile loop. Previously delegated to StatefulSet's PersistentVolumeClaimRetentionPolicy, which had version and behavior quirks. |
| Drain state machine | Scale-down coordinates drain and etcd cleanup within the shard controller to safely remove standbys from the sync standby list. |
| Rolling updates with primary awareness | Updates replicas first, primary last, with switchover coordination. |
| Parallel creation | New pods for a pool are all created in a single reconcile pass. This enables faster bootstrap by allowing multiple replicas to restore from backup simultaneously. Rolling updates and scale-down remain strictly one-at-a-time. |
| Pod Disruption Budgets | Per-pool PDBs with maxUnavailable: 1 protect against voluntary evictions. |
| Better observability | Status aggregation directly from pod readiness conditions; no intermediate StatefulSet .status layer. |
- Automatic pod recreation: The reconcile loop detects missing pods and recreates them.
- Stable network identity: Pods still get DNS-resolvable names via the headless service (hostname + subdomain).
- Ordered startup: Was never needed —
ParallelPodManagementwas already in use. Parallel creation restores this behavior for initial bootstrap. However, it introduces a scheduling trade-off for shared RWO volumes — see PVC Management § Shared Backup PVC. - PVC auto-creation: Operator creates PVCs for each pod index, mirroring
volumeClaimTemplatesbehavior.
[MultigresCluster] 🚀 (Root CR - user-editable)
│
├── 📍 Defines [TemplateDefaults] (Cluster-wide default templates)
│
├── 🌍 [GlobalTopoServer] (Child CR) ← 📄 Uses [CoreTemplate] OR inline [spec]
│
├── 🤖 MultiAdmin Resources ← 📄 Uses [CoreTemplate] OR inline [spec]
│
├── 💠 [Cell] (Child CR) ← 📄 Uses [CellTemplate] OR inline [spec]
│ │
│ ├── 🚪 MultiGateway Resources (Deployment + Service)
│ └── 📡 [LocalTopoServer] (Child CR, optional)
│
└── 🗃️ [TableGroup] (Child CR)
│
└── 📦 [Shard] (Child CR) ← 📄 Uses [ShardTemplate] OR inline [spec]
│
├── 🧠 MultiOrch (Deployment + Service, per-cell)
└── 🏊 Pools (per pool × per cell):
├── Pod-0 ← operator-managed
├── Pod-1 ← operator-managed
├── PVC-0 ← operator-managed (per-pod data)
├── PVC-1 ← operator-managed (per-pod data)
├── Backup PVC (shared across pods in cell)
├── Headless Service (DNS resolution)
└── PodDisruptionBudget (maxUnavailable: 1)
Key points:
- No StatefulSets in the resource tree for pool pods.
- Headless Service is still required for DNS resolution. Multigres's
FullyQualifiedHostname()uses DNS reverse lookup, so pods need resolvable FQDNs. - Shared Backup PVC is per-shard-per-cell for filesystem backups; replaced with EmptyDir for S3. Requires RWX storage or S3 for multi-replica cells — see PVC Management § Shared Backup PVC.
- PDB is per-pool-per-cell to limit voluntary disruption.
The createMissingResources function handles pod creation:
- For each desired index
0..replicasPerCell-1, check if a PVC and Pod exist. - Create missing PVCs first, then Pods that reference them.
- Parallel creation: All missing PVCs and pods are created in a single reconcile pass. This enables faster bootstrap by allowing standbys to restore from pgBackRest (S3/repo) in parallel without I/O impact on the primary. Multiorch dynamically manages synchronous replication membership via
FixReplicationAction, so parallel creation is safe from a quorum perspective. - Terminal pods (
FailedorSucceeded) are cleaned up sequentially — one per reconcile — to avoid cascading deletions.
Scale-down uses the drain state machine. Extra pods beyond the desired count are identified, and the highest-index non-primary pod is selected for draining. Before initiating a new drain, the operator verifies that all non-draining, non-terminating pods in the pool are Ready. If the pool is already degraded, the scale-down is deferred and a ScaleDownBlocked warning event is emitted to prevent cascading failures.
When a pod is deleted externally (e.g., kubectl delete pod), it enters the drain state machine to ensure etcd cleanup before the pod is fully removed.
PVC lifecycle is managed through conditional owner references based on the PVCDeletionPolicy:
- When
WhenDeletedisDelete: PVCs are created with an ownerRef pointing to the Shard CR, enabling Kubernetes garbage collection to cascade-delete them when the Shard is removed. - When
WhenDeletedisRetain: PVCs are created without ownerRefs, ensuring they persist after Shard deletion. - The shard controller's
reconcilePVCOwnerRefsfunction ensures existing PVCs stay in sync with the current policy — adding or removing ownerRefs as the policy changes mid-lifecycle. - During scale-down,
cleanupDrainedPodchecksWhenScaledand deletes data PVCs directly if the policy isDelete(the default). For DRAINED pods (identified by themultigres.com/role=DRAINEDlabel), PVCs are always deleted regardless of theWhenScaledpolicy because DRAINED pod data is known-bad.
The drain state machine coordinates pod removal within the shard controller. The controller handles the full drain lifecycle: selecting which pod to drain, initiating the drain (removing from synchronous_standby_names, unregistering from etcd), and cleaning up the pod and PVC afterward.
┌──────────────────────────────────┐
│ Scale-down detected │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ Pod selected for drain │
│ (non-ready > non-primary > │
│ highest index) │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ state=requested │
│ + drain-requested-at timestamp │
│ (set by resource-handler) │
│ │
│ ⚠ Cancellable: if the desired │
│ state reverts (e.g., scale-down │
│ reversed), the drain is cleared │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ IsPrimaryNotReady guard │
│ If the primary pod's containers │
│ are not ready, delay and │
│ requeue (do not send RPCs) │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ state=draining │
│ Remove pod from synchronous │
│ standby list on the primary │
│ (with ReloadConfig: true) │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ IsPrimaryNotReady guard │
│ (same check as above) │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ state=acknowledged │
│ Verify standby removal is │
│ reflected in pg_stat_replication │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ state=ready-for-deletion │
│ Unregister from etcd topology │
└──────────────┬───────────────────┘
│
┌──────────────▼───────────────────┐
│ Shard controller cleans up: │
│ 1. Delete data PVC (if policy) │
│ 2. Pod is garbage collected │
└──────────────────────────────────┘
When choosing which pod to remove during scale-down:
- Strongly disfavor the primary (score penalty −1000). The pod role is read from
shard.Status.PodRoles, which the shard controller populates by reading etcd topology. The primary is never the preferred target, but it is not hard-excluded: if the primary is the only extra pod remaining (e.g. after all replicas have already been removed), it can still be selected to avoid deadlocking the scale-down. Rolling updates handle the primary separately viahandleRollingUpdates(replicas first, primary last with switchover). - Prefer non-ready pods. Pods that are already failing are better candidates.
- Among ready non-primary pods, select the highest index. This makes order deterministic and predictable.
- If no suitable pod is found, defer. Requeue and try again.
- At most one pod per pool can be in the drain state at any time.
- Scale-down and rolling-update operations do not run concurrently — if a drain is in progress, rolling updates are deferred.
- If the topology store is temporarily unreachable, the drain annotation sits untouched and the shard controller retries on the next reconcile.
- Health gate: Scale-down drains are deferred when any non-draining pod is not Ready, preventing removal of pods from an already degraded pool.
- Primary readiness guard: The
IsPrimaryNotReadycheck prevents sendingUpdateSynchronousStandbyListRPCs to a primary whose containers are not ready. This avoids failed gRPC calls that could leave the drain in an inconsistent state. - Config reload on standby removal:
UpdateSynchronousStandbyListrequests includeReloadConfig: true, ensuring PostgreSQL reloadssynchronous_standby_namesimmediately after the standby is removed. - Stale drain cancellation: If the desired state reverts while a drain is in the
requestedstate (e.g., user reverses a scale-down), the drain annotations are cleared and the pod is kept. Once a drain enters thedrainingstate, it cannot be cancelled because the RPC has already been sent.
The drain state machine is not used during shard or cluster deletion. When a Shard is deleted, the shard controller's handleDeletion runs: it unregisters the database from the global topology (DeleteDatabase) and deletes pods directly. PVCs are garbage-collected via owner references (when WhenDeleted is Delete) or retained (when Retain).
This is safe for full cluster deletion because the topo server is also being destroyed. For individual shard deletion within a live cluster, residual pooler entries may remain in topo — see Known Gaps.
The drain state machine remains fully functional for rolling updates and scale-down, where the cluster is alive and the shard controller is actively reconciling.
When a database or cell is removed from the MultigresCluster spec, the resulting orphan resource is not deleted immediately. Instead, the MultigresCluster controller follows a 3-step PendingDeletion flow that routes through the drain state machine to prevent data loss:
- Annotate: The orphan resource receives a
multigres.com/pending-deletionannotation. - Wait for ReadyForDeletion: The child controller detects the annotation and takes appropriate cleanup action:
- TableGroup: Propagates
PendingDeletionto all child Shards and waits for all to reachReadyForDeletion(i.e., all pods drained). - Cell: Sets
ReadyForDeletionimmediately since gateways are stateless.
- TableGroup: Propagates
- Delete: Once the
ReadyForDeletioncondition is true, the MultigresCluster controller deletes the resource.
During deletion, both Cell and Shard controllers use a TopologyRegistered / DatabaseRegistered status condition to determine whether topology cleanup is needed. This prevents two failure modes:
- Never initialized: If the condition was never set (e.g., the topology server was unreachable since creation), cleanup is skipped immediately — there's nothing to clean up.
- Transient failure: If the condition is
Truebut the topology is temporarily unreachable, the controller retries for up totopoCleanupTimeout(currently 2 minutes). After the timeout, the controller force-skips cleanup with aCleanupSkippedwarning event and proceeds with deletion.
Future consideration: The 2-minute timeout is conservative for the current deployment model. As real-world usage patterns emerge, this value may need tuning. It could also be made configurable via the MultigresCluster spec if operators need per-cluster control.
When a pod's spec has drifted from the desired state (detected via spec-hash mismatch), the reconciler performs a rolling update:
- Identify drifted pods —
podNeedsUpdatecompares each pod'smultigres.com/spec-hashannotation against the hash of the currently desired spec. - Skip if drain in progress — Rolling updates are deferred if any pod is currently being drained (scale-down takes precedence).
- Update replicas first — Non-primary drifted pods are selected first.
- Primary last — When only the primary remains, a controlled switchover is needed before draining and recreating. (Note: the switchover is coordinated via the shard controller using the same drain annotation mechanism.)
- One at a time — Only one pod is drained per reconcile cycle. The reconciler returns early after initiating a drain, waiting for the pod to reach
ready-for-deletionbefore proceeding to the next. - RollingUpdate status condition — A
RollingUpdatecondition is set on the Shard to track progress (e.g.,"2/5 pods updated").
Since most pod spec fields are immutable after creation, the operator uses a hash-based approach to detect drift:
- At pod creation,
ComputeSpecHashproduces an FNV-1a hex string over all operator-managed fields:- Container images, commands, args, env vars, resources, volume mounts, security contexts
- Pod affinity, node selector, volume definitions, termination grace period
- The
multigres.com/postgres-config-hashannotation value (SHA-256 of the referenced ConfigMap key data), when apostgresConfigRefis set. This ensures ConfigMap content changes trigger rolling updates even though the pod spec itself is unchanged.
- The hash is stored as annotation
multigres.com/spec-hashon the pod. - On each reconcile,
podNeedsUpdatebuilds the desired pod and computes its hash. If it doesn't match the existing pod's annotation, the pod is flagged for update.
This avoids false positives from admission controllers (Istio, Linkerd, Vault) that inject sidecars or environment variables — only fields the operator explicitly sets are included in the hash.
Each pool pod gets its own data PVC named data-{base-name}-{index}. These PVCs:
- Are created before the pod (if missing).
- Persist across pod deletions — a restarted pod reattaches to its existing PVC and finds its PGDATA intact.
- Are deleted during scale-down only if
PVCDeletionPolicy.WhenScaledisDelete. - During cluster/shard deletion, PVCs are garbage-collected by Kubernetes via conditional owner references when
PVCDeletionPolicy.WhenDeletedisDelete. When the policy isRetain, PVCs have no ownerRef and persist after deletion.
One shared PVC per shard per cell for filesystem-based backups, mounted at /backups on all pods. For S3 backups, this is replaced with an EmptyDir volume.
⚠️ Multi-Attach Limitation with RWO Storage:When using standard block storage (EBS gp2/gp3) with
ReadWriteOnceaccess mode andWaitForFirstConsumerbinding, the sequential pod creation model causesMulti-Attacherrors for multi-replica cells:
- Pod-0 is created, scheduled to Node A. The CSI driver provisions the EBS volume in Node A's AZ and attaches it.
- Pod-0 becomes Ready. The operator creates Pod-1 on the next reconcile.
- Pod-1 hits the scheduler. The PV's
nodeAffinityconstrains it to the same AZ, but the scheduler does not detect the active RWO attachment on Node A. If a second node exists in the AZ, the scheduler may place Pod-1 there.- The
attachdetach-controllerattempts to attach the volume to Node B and fails withMulti-Attach error.Why this didn't happen with StatefulSets (v0.2.6): The old architecture used
ParallelPodManagement, which submitted all pods to the scheduler simultaneously. WithWaitForFirstConsumer, the PVC was still unbound when all pods entered the scheduling queue. The first pod to be processed annotated the PVC withvolume.kubernetes.io/selected-node, constraining subsequent pods to the same AZ. With one node per AZ (common in dev/test EKS clusters), all pods landed on the same node deterministically. This prevented Multi-Attach errors but caused silent HA loss — all replicas on a single node. The current parallel creation approach has the same scheduling characteristic.Solutions:
- S3 (Recommended for production): No shared PVC needed — the operator uses
EmptyDirfor scratch space and all pods connect to S3 independently.- RWX Storage: Use a StorageClass that supports
ReadWriteMany(NFS, EFS, CephFS).- Single replica per cell: With
replicasPerCell: 1, only one pod mounts the PVC and the issue does not arise.
See the Backup Architecture document for the shared PVC design rationale.
The PVCDeletionPolicy type (whenDeleted, whenScaled) controls lifecycle. WhenDeleted defaults to Retain for data safety; WhenScaled defaults to Delete so pgbackrest is the source of truth for recovery. This is no longer tied to Kubernetes's StatefulSetPersistentVolumeClaimRetentionPolicy — the operator manages PVC deletion directly.
The operator supports in-place PVC volume expansion for data and backup PVCs. When a user increases storage.size on a pool, the operator patches the existing PVC's spec.resources.requests.storage to the new value. Kubernetes and the CSI driver handle the actual block device expansion.
Requirements:
- The
StorageClassused by the PVC must haveallowVolumeExpansion: true. Without this, the Kubernetes API server will reject the PVC patch and the operator will emit aWarningevent. - Volume expansion is grow-only. Decreasing
storage.sizeis rejected at admission time — PVC shrinks are not supported by Kubernetes.
How it works:
- The operator detects that the desired
storage.sizeexceeds the current PVC spec. - The PVC spec is patched in-place (no pod restart needed for block device expansion).
- For CSI drivers that support online filesystem expansion (EBS CSI ≥ v1.5, GCE PD CSI), the filesystem grows live without pod restart.
- For CSI drivers that require pod restart for filesystem expansion, the operator detects the
FileSystemResizePendingPVC condition and initiates a drain on the affected pod via the existing drain state machine.
Note: The same mechanism applies to shared backup PVCs when using filesystem-based backups.
For each pool-cell combination, the operator creates a PodDisruptionBudget with:
maxUnavailable: 1— At most one pod can be voluntarily evicted at a time.- Label selector matching pool pods in that cell.
This protects against node drains or Kubernetes upgrades taking down too many replicas simultaneously. The PDB is owned by the Shard CR for garbage collection.
The updatePoolsStatus function directly lists pods by label selector and aggregates:
| Status Field | Source |
|---|---|
shard.Status.PoolsReady |
true when readyPods == totalPods > 0 |
shard.Status.ReadyReplicas |
Count of pods with PodReady=True condition |
shard.Status.OrchReady |
MultiOrch deployment readiness |
shard.Status.Phase |
Degraded if any pod is crash-looping; Healthy when both pools and orch are ready; Progressing otherwise. See Phase Lifecycle. |
shard.Status.Cells |
Observed cells from pod labels |
shard.Status.PodRoles |
Pod → role mapping (populated by the shard controller from etcd) |
shard.Status.LastBackupTime |
Timestamp of last completed backup (from GetBackups RPC) |
shard.Status.LastBackupType |
Type of last backup (full/diff/incr) |
shard.Status.Conditions |
Available, BackupHealthy, DatabaseRegistered, RollingUpdate |
Terminating pods (with a non-zero DeletionTimestamp) and draining pods (with drain annotations) are excluded from ready counts.
Metrics are emitted per pool via monitoring.SetShardPoolReplicas(). A PoolEmpty warning event is emitted when a cell has zero ready replicas but the desired count is > 0. Backup age is emitted via monitoring.SetLastBackupAge().
| Feature | Details |
|---|---|
| Direct pod creation and recreation | Pods created per-index, auto-recreated on failure |
| Parallel pod creation | All missing pods created in a single reconcile pass for fast bootstrap |
| Per-pod data PVCs | Explicitly created, named by index |
| Shared backup PVC | Per-shard-per-cell, skipped for S3 |
| Headless Service for DNS | Pod hostname + subdomain for FQDN resolution |
| PodDisruptionBudgets | maxUnavailable: 1 per pool per cell |
| Drain state machine | Annotation-based, coordinated within the shard controller |
| Pod selection for scale-down | Primary avoidance, prefer non-ready, highest index |
| Rolling updates | Spec-hash drift detection, replicas first, primary last |
| Owner-reference-based PVC cleanup | Conditional ownerRefs on PVCs enable Kubernetes GC cascade-delete per PVCDeletionPolicy |
| Shard deletion handling | Deletes pods and deployments directly (no drain); PVCs garbage-collected via ownerRefs per policy |
| Status aggregation from pods | Direct pod count, no StatefulSet intermediary |
| Zone/region scheduling | nodeSelector injection from CellTopologyLabels |
| Cell topology propagation | Labels carry cluster/db/tg/shard/pool/cell hierarchy |
| pgBackRest TLS certificates | Auto-generated or user-provided (cert-manager compatible) |
| S3 and filesystem backup support | Full backup configuration propagation |
| PVC deletion policy | Hierarchical merge, Retain/Delete per whenDeleted/whenScaled |
| Etcd topology cleanup | UnregisterMultiPooler called during drain flow; stale entries removed on pod termination |
| Topology registration & pruning | Cell and database registration centralized in MultigresCluster controller; stale entries pruned when topologyPruning.enabled (default) |
| Backup health reporting | Shard controller calls GetBackups RPC, sets BackupHealthy condition and LastBackupTime status |
| DRAINED pod handling | DRAINED pods (diverged data, pg_rewind failure) are kept alive for admin investigation. Stand-in replicas created at next index for availability. Admin discards via kubectl delete pod, triggering drain + PVC deletion |
| Scale-down health gate | Drains deferred when pool has non-ready pods to prevent cascading failures |
| Observability | Events, conditions, metrics, tracing spans |
These features are designed and documented in pod-management-design.md but require upstream multigres changes:
| Feature | Gap | Notes |
|---|---|---|
| WAL archiving failure detection | Gap 1 | Needs multipooler to expose pg_stat_archiver via etcd |
| Standby waiting-for-backup surfacing | Gap 4 | Needs multipooler to expose monitor_reason via etcd |
| Point-in-Time Recovery | Gap 9 | Separate product feature, not urgent |
| Feature | Reason |
|---|---|
| Scheduled base backups | Per design review meeting, kept peripheral to operator |
All pod management code lives in pkg/resource-handler/controller/shard/:
| File | Purpose |
|---|---|
shard_controller.go |
Main reconciler. Orchestrates pool reconciliation, sets up watches (including ConfigMap watches via enqueueFromPostgresConfigMap for postgresConfigRef changes), PVC owner-reference management. |
reconcile_pool_pods.go |
Core pod lifecycle: reconcilePoolPods, createMissingResources, handleScaleDown, handleRollingUpdates, selectPodToDrain, cleanupDrainedPod, podNeedsUpdate. |
reconcile_deletion.go |
Shard deletion: handleDeletion, topology unregistration, child resource cleanup. |
reconcile_multiorch.go |
MultiOrch deployment and service reconciliation. |
reconcile_shared_infra.go |
Shared infrastructure: pgBackRest TLS certs, pg_hba ConfigMap, postgres password Secret, shared backup PVC, PDB, headless service. |
reconcile_data_plane.go |
Data-plane reconciliation: pod role reporting, drain state machine execution, backup health evaluation. |
pool_pod.go |
Pod builder: BuildPoolPod, BuildPoolPodName, ComputeSpecHash. |
pool_pvc.go |
PVC builders: BuildPoolDataPVC, BuildPoolDataPVCName, BuildSharedBackupPVC (with conditional ownerRef support). |
pool_pdb.go |
PDB builder: BuildPoolPodDisruptionBudget. |
pool_service.go |
Headless service builder for DNS resolution. |
drain_helpers.go |
Drain utilities: resolvePodRole (reads PodRoles from shard status), initiateDrain (sets drain annotation). |
labels.go |
Label builder: buildPoolLabelsWithCell — creates the standard label set for pool resources. |
status.go |
Status aggregation: updateStatus, updatePoolsStatus (counts pods directly), updateMultiOrchStatus, setConditions. |
containers.go |
Container builders for pgctld and multipooler. Injects POSTGRES_INITDB_ARGS env var on pgctld when shard.Spec.InitdbArgs is set. Mounts the postgresql.conf ConfigMap and passes --postgres-config-template to pgctld. |
configmap.go |
ConfigMap builders: BuildPgHbaConfigMap (pg_hba template), BuildPostgresConfigConfigMap (postgresql.conf template with user overrides from shard.Spec.PostgresConfig appended). |
multiorch.go |
MultiOrch deployment and service builders. |
doc.go |
Package documentation. |
| File | Coverage |
|---|---|
pool_pod_test.go |
Pod builder, spec-hash computation, naming |
pool_pvc_test.go |
PVC builder, naming, storage class handling, ownerRef tests |
pool_service_test.go |
Headless service labels and selector |
shard_controller_test.go |
Unit tests for reconcile behavior (envtest-based) |
shard_controller_internal_test.go |
Internal function tests (podNeedsUpdate, etc.) |
reconcile_deletion_test.go |
Deletion handler envtest tests |
reconcile_deletion_internal_test.go |
Deletion handler internal tests |
integration_test.go |
Full integration tests: scale up/down, PDB creation, ConfigMap, services |
containers_test.go |
Container builder tests |
configmap_test.go |
pg_hba ConfigMap generation |
multiorch_test.go |
MultiOrch deployment/service builders |
secret_test.go |
Postgres password secret |
ports_test.go |
Port constant tests |
{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-{index}
Generated by BuildPoolPodName using PodConstraints (60 chars max for the base name, then -{index} suffix appended). The hash is an 8-character FNV-1a hex suffix ensuring uniqueness even after truncation.
data-{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-{index}
Same structure as pod names with a data- prefix.
backup-data-{cluster}-{db}-{tg}-{shard}-{cell}-{hash}
{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-pdb
{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-headless
Pods reference this via spec.subdomain, combined with spec.hostname (set to the pod name) to produce FQDNs like pod-name.headless-svc.namespace.svc.cluster.local.
The operator reads from etcd topology (via the shard controller) to:
- Determine pod roles (
PRIMARY,REPLICA,DRAINED) for scale-down and rolling-update decisions. - Clean up stale topology entries on permanent pod removal (
UnregisterMultiPooler).
The operator writes to etcd topology (via the MultigresCluster controller) to:
- Register cells and databases during initial setup and on every reconcile.
- Prune stale cells and databases that no longer exist in the spec (when
topologyPruning.enabled, the default).
The operator writes to etcd topology (via the shard controller) to:
- Unregister poolers during drain cleanup.
Through the shard controller:
UpdateSynchronousStandbyList(REMOVE)— Removes a standby before pod deletion.UnregisterMultiPooler— Deletes the etcd entry for a permanently removed pod.
| Operation | Component | Notes |
|---|---|---|
| Primary election (bootstrap) | multiorch | BootstrapShardAction via consensus |
| Failover | multiorch | AppointLeaderAction on primary failure |
| Replication setup | multiorch | Detects new poolers, configures streaming replication |
| Sync standby list (ADD) | multiorch | FixReplicationAction |
| WAL archiving | PostgreSQL | archive_command → pgBackRest |
| Backup execution | multiadmin | Selects a replica, calls Backup RPC |
| Replica restore | multipooler | MonitorPostgres detects empty PGDATA, restores from backup |
| Operation | How |
|---|---|
| Pod creation and recreation | Direct pod management via reconcile loop |
| Scale-up (new replicas) | Create PVC + Pod, multigres handles the rest |
| Scale-down | Drain state machine with standby removal + etcd unregistration |
| Rolling update | Spec-hash detection, ordered recreation |
| DRAINED pod handling | Detects DRAINED role from etcd, keeps pod alive for investigation, creates stand-in replica |
| Backup health reporting | Calls GetBackups RPC, sets BackupHealthy condition |
| PVC lifecycle | Direct creation/deletion per policy |
| Certificate provisioning | pkg/cert for pgBackRest TLS |
| Status reporting | Aggregate from pod conditions, etcd roles, and backup metadata |
| Gap | Issue | Description |
|---|---|---|
| WAL archiving failure detection | #654 | Multipooler should report pg_stat_archiver status via etcd |
| Standby stuck waiting for backup | #652 | Multipooler should expose monitor_reason in etcd |
- Scheduled base backups: Designed as a controller-timer approach (like CloudNativePG) but deferred per team decision.
- Individual shard deletion drain cleanup: When a single shard is deleted within a live cluster (not full cluster teardown), pooler entries may be left in topo because
handleDeletionskips the drain state machine and deletes pods directly. Addressing this would require running the drain flow during deletion to unregister individual poolers before pod removal. Not addressed now because the operator currently only supports single-shard, single-database clusters — shards are only ever deleted as part of full cluster teardown where the topo server is also destroyed.
These constraints are enforced via CEL validation and prevent unsupported operations:
| Constraint | Enforcement |
|---|---|
Single database (postgres) |
CEL on databases array |
Single shard (0-inf) |
CEL on shard name |
| Pools are append-only | CEL prevents pool removal or rename |
| Cells are append-only | CEL prevents cell removal (on MultigresCluster and Shard) |
| Zone/region immutability | CEL on Cell spec fields |