Pod Management Architecture

This document explains how the multigres-operator manages PostgreSQL pool pods directly, replacing the original StatefulSet-based approach. It covers the "why", "how", code structure, current feature set, and known gaps.

Overview
Why Direct Pod Management
Resource Topology
Pod Lifecycle
Drain State Machine
Rolling Updates
Spec-Hash Drift Detection
PVC Management
Pod Disruption Budgets
Status Aggregation
Feature Summary
Code Organization
Naming Conventions
Interaction with Multigres
Known Gaps and Future Work

1. Overview

The operator manages individual Pods and PersistentVolumeClaims for each pool replica instead of delegating to Kubernetes StatefulSets. The shard controller (pkg/resource-handler/controller/shard/) owns Pod, PVC, PDB, headless Service, and ConfigMap resources directly, reconciling them on every loop.

This architectural change was motivated by the fact that multigres has its own identity and discovery system (etcd topology) that is completely independent of Kubernetes StatefulSet identity. The operator previously used ParallelPodManagement and did not leverage ordered deployment or scaling — the two main features that StatefulSets provide over Deployments.

Scheduling Note: With parallel pod creation, all pods for a pool are submitted in the same reconcile pass. For backup volumes using WaitForFirstConsumer + RWO storage (e.g., EBS gp2), if multiple replicas are created simultaneously, the scheduler may place them on different nodes, causing Multi-Attach errors on the shared backup PVC. See PVC Management § Shared Backup PVC for details and recommended solutions (S3 or RWX storage).

2. Why Direct Pod Management

Problems with StatefulSets

Problem	Impact
No targeted decommissioning	StatefulSets can only scale from ordinal N-1 downward. Deleting a specific replica (e.g., one whose data is corrupt or whose zone is draining) required external hacks.
No independent zone/cell management	Each cell already needed its own StatefulSet. The operator gained no simplification from StatefulSet identity guarantees.
PVC lifecycle is opaque	The only mechanism was `PersistentVolumeClaimRetentionPolicy`, which is limited and version-dependent.
Rename/replace ambiguity	StatefulSet ordinal identity conflicted with multigres's own etcd-based identity. Replacing a pod meant fighting the controller.
GitOps drift	StatefulSets report `replicas: N` in status, causing constant "configured" drift when using `kubectl apply`.
Blocked rolling updates	StatefulSet rolling update strategy (`OnDelete` or `RollingUpdate`) doesn't coordinate with multigres replication — it can't remove a standby from `synchronous_standby_names` before killing the pod.

What We Gained

Benefit	Detail
Targeted decommissioning	Any specific pod can be drained and removed without affecting others.
Simpler PVC lifecycle	PVC creation/deletion is handled directly in the reconcile loop. Previously delegated to StatefulSet's `PersistentVolumeClaimRetentionPolicy`, which had version and behavior quirks.
Drain state machine	Scale-down coordinates drain and etcd cleanup within the shard controller to safely remove standbys from the sync standby list.
Rolling updates with primary awareness	Updates replicas first, primary last, with switchover coordination.
Parallel creation	New pods for a pool are all created in a single reconcile pass. This enables faster bootstrap by allowing multiple replicas to restore from backup simultaneously. Rolling updates and scale-down remain strictly one-at-a-time.
Pod Disruption Budgets	Per-pool PDBs with `maxUnavailable: 1` protect against voluntary evictions.
Better observability	Status aggregation directly from pod readiness conditions; no intermediate StatefulSet `.status` layer.

What We Didn't Lose

Automatic pod recreation: The reconcile loop detects missing pods and recreates them.
Stable network identity: Pods still get DNS-resolvable names via the headless service (hostname + subdomain).
Ordered startup: Was never needed — ParallelPodManagement was already in use. Parallel creation restores this behavior for initial bootstrap. However, it introduces a scheduling trade-off for shared RWO volumes — see PVC Management § Shared Backup PVC.
PVC auto-creation: Operator creates PVCs for each pod index, mirroring volumeClaimTemplates behavior.

3. Resource Topology

[MultigresCluster] 🚀 (Root CR - user-editable)
      │
      ├── 📍 Defines [TemplateDefaults] (Cluster-wide default templates)
      │
      ├── 🌍 [GlobalTopoServer] (Child CR) ← 📄 Uses [CoreTemplate] OR inline [spec]
      │
      ├── 🤖 MultiAdmin Resources ← 📄 Uses [CoreTemplate] OR inline [spec]
      │
      ├── 💠 [Cell] (Child CR) ← 📄 Uses [CellTemplate] OR inline [spec]
      │    │
      │    ├── 🚪 MultiGateway Resources (Deployment + Service)
      │    └── 📡 [LocalTopoServer] (Child CR, optional)
      │
      └── 🗃️ [TableGroup] (Child CR)
           │
           └── 📦 [Shard] (Child CR) ← 📄 Uses [ShardTemplate] OR inline [spec]
                │
                ├── 🧠 MultiOrch (Deployment + Service, per-cell)
                └── 🏊 Pools (per pool × per cell):
                     ├── Pod-0  ← operator-managed
                     ├── Pod-1  ← operator-managed
                     ├── PVC-0  ← operator-managed (per-pod data)
                     ├── PVC-1  ← operator-managed (per-pod data)
                     ├── Backup PVC (shared across pods in cell)
                     ├── Headless Service (DNS resolution)
                     └── PodDisruptionBudget (maxUnavailable: 1)

Key points:

No StatefulSets in the resource tree for pool pods.
Headless Service is still required for DNS resolution. Multigres's FullyQualifiedHostname() uses DNS reverse lookup, so pods need resolvable FQDNs.
Shared Backup PVC is per-shard-per-cell for filesystem backups; replaced with EmptyDir for S3. Requires RWX storage or S3 for multi-replica cells — see PVC Management § Shared Backup PVC.
PDB is per-pool-per-cell to limit voluntary disruption.

4. Pod Lifecycle

Creation (Scale-Up)

The createMissingResources function handles pod creation:

For each desired index 0..replicasPerCell-1, check if a PVC and Pod exist.
Create missing PVCs first, then Pods that reference them.
Parallel creation: All missing PVCs and pods are created in a single reconcile pass. This enables faster bootstrap by allowing standbys to restore from pgBackRest (S3/repo) in parallel without I/O impact on the primary. Multiorch dynamically manages synchronous replication membership via FixReplicationAction, so parallel creation is safe from a quorum perspective.
Terminal pods (Failed or Succeeded) are cleaned up sequentially — one per reconcile — to avoid cascading deletions.

Deletion (Scale-Down)

Scale-down uses the drain state machine. Extra pods beyond the desired count are identified, and the highest-index non-primary pod is selected for draining. Before initiating a new drain, the operator verifies that all non-draining, non-terminating pods in the pool are Ready. If the pool is already degraded, the scale-down is deferred and a ScaleDownBlocked warning event is emitted to prevent cascading failures.

External Deletion

When a pod is deleted externally (e.g., kubectl delete pod), it enters the drain state machine to ensure etcd cleanup before the pod is fully removed.

PVC Lifecycle via Owner References

PVC lifecycle is managed through conditional owner references based on the PVCDeletionPolicy:

When WhenDeleted is Delete: PVCs are created with an ownerRef pointing to the Shard CR, enabling Kubernetes garbage collection to cascade-delete them when the Shard is removed.
When WhenDeleted is Retain: PVCs are created without ownerRefs, ensuring they persist after Shard deletion.
The shard controller's reconcilePVCOwnerRefs function ensures existing PVCs stay in sync with the current policy — adding or removing ownerRefs as the policy changes mid-lifecycle.
During scale-down, cleanupDrainedPod checks WhenScaled and deletes data PVCs directly if the policy is Delete (the default). For DRAINED pods (identified by the multigres.com/role=DRAINED label), PVCs are always deleted regardless of the WhenScaled policy because DRAINED pod data is known-bad.

5. Drain State Machine

The drain state machine coordinates pod removal within the shard controller. The controller handles the full drain lifecycle: selecting which pod to drain, initiating the drain (removing from synchronous_standby_names, unregistering from etcd), and cleaning up the pod and PVC afterward.

State Flow

                    ┌──────────────────────────────────┐
                    │       Scale-down detected        │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │     Pod selected for drain       │
                    │  (non-ready > non-primary >      │
                    │   highest index)                 │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  state=requested                 │
                    │  + drain-requested-at timestamp  │
                    │  (set by resource-handler)       │
                    │                                  │
                    │  ⚠ Cancellable: if the desired   │
                    │  state reverts (e.g., scale-down │
                    │  reversed), the drain is cleared │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  IsPrimaryNotReady guard         │
                    │  If the primary pod's containers │
                    │  are not ready, delay and        │
                    │  requeue (do not send RPCs)      │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  state=draining                  │
                    │  Remove pod from synchronous     │
                    │  standby list on the primary     │
                    │  (with ReloadConfig: true)       │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  IsPrimaryNotReady guard         │
                    │  (same check as above)           │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  state=acknowledged              │
                    │  Verify standby removal is       │
                    │  reflected in pg_stat_replication │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  state=ready-for-deletion        │
                    │  Unregister from etcd topology   │
                    └──────────────┬───────────────────┘
                                   │
                    ┌──────────────▼───────────────────┐
                    │  Shard controller cleans up:     │
                    │  1. Delete data PVC (if policy)  │
                    │  2. Pod is garbage collected     │
                    └──────────────────────────────────┘

Pod Selection Algorithm

When choosing which pod to remove during scale-down:

Strongly disfavor the primary (score penalty −1000). The pod role is read from shard.Status.PodRoles, which the shard controller populates by reading etcd topology. The primary is never the preferred target, but it is not hard-excluded: if the primary is the only extra pod remaining (e.g. after all replicas have already been removed), it can still be selected to avoid deadlocking the scale-down. Rolling updates handle the primary separately via handleRollingUpdates (replicas first, primary last with switchover).
Prefer non-ready pods. Pods that are already failing are better candidates.
Among ready non-primary pods, select the highest index. This makes order deterministic and predictable.
If no suitable pod is found, defer. Requeue and try again.

Safety Guarantees

At most one pod per pool can be in the drain state at any time.
Scale-down and rolling-update operations do not run concurrently — if a drain is in progress, rolling updates are deferred.
If the topology store is temporarily unreachable, the drain annotation sits untouched and the shard controller retries on the next reconcile.
Health gate: Scale-down drains are deferred when any non-draining pod is not Ready, preventing removal of pods from an already degraded pool.
Primary readiness guard: The IsPrimaryNotReady check prevents sending UpdateSynchronousStandbyList RPCs to a primary whose containers are not ready. This avoids failed gRPC calls that could leave the drain in an inconsistent state.
Config reload on standby removal: UpdateSynchronousStandbyList requests include ReloadConfig: true, ensuring PostgreSQL reloads synchronous_standby_names immediately after the standby is removed.
Stale drain cancellation: If the desired state reverts while a drain is in the requested state (e.g., user reverses a scale-down), the drain annotations are cleared and the pod is kept. Once a drain enters the draining state, it cannot be cancelled because the RPC has already been sent.

Shard/Cluster Deletion (No Drain)

The drain state machine is not used during shard or cluster deletion. When a Shard is deleted, the shard controller's handleDeletion runs: it unregisters the database from the global topology (DeleteDatabase) and deletes pods directly. PVCs are garbage-collected via owner references (when WhenDeleted is Delete) or retained (when Retain).

This is safe for full cluster deletion because the topo server is also being destroyed. For individual shard deletion within a live cluster, residual pooler entries may remain in topo — see Known Gaps.

The drain state machine remains fully functional for rolling updates and scale-down, where the cluster is alive and the shard controller is actively reconciling.

Graceful Orphan Deletion (TableGroups and Cells)

When a database or cell is removed from the MultigresCluster spec, the resulting orphan resource is not deleted immediately. Instead, the MultigresCluster controller follows a 3-step PendingDeletion flow that routes through the drain state machine to prevent data loss:

Annotate: The orphan resource receives a multigres.com/pending-deletion annotation.
Wait for ReadyForDeletion: The child controller detects the annotation and takes appropriate cleanup action:
- TableGroup: Propagates PendingDeletion to all child Shards and waits for all to reach ReadyForDeletion (i.e., all pods drained).
- Cell: Sets ReadyForDeletion immediately since gateways are stateless.
Delete: Once the ReadyForDeletion condition is true, the MultigresCluster controller deletes the resource.

Topology Cleanup Timeout

During deletion, both Cell and Shard controllers use a TopologyRegistered / DatabaseRegistered status condition to determine whether topology cleanup is needed. This prevents two failure modes:

Never initialized: If the condition was never set (e.g., the topology server was unreachable since creation), cleanup is skipped immediately — there's nothing to clean up.
Transient failure: If the condition is True but the topology is temporarily unreachable, the controller retries for up to topoCleanupTimeout (currently 2 minutes). After the timeout, the controller force-skips cleanup with a CleanupSkipped warning event and proceeds with deletion.

Future consideration: The 2-minute timeout is conservative for the current deployment model. As real-world usage patterns emerge, this value may need tuning. It could also be made configurable via the MultigresCluster spec if operators need per-cluster control.

6. Rolling Updates

When a pod's spec has drifted from the desired state (detected via spec-hash mismatch), the reconciler performs a rolling update:

Identify drifted pods — podNeedsUpdate compares each pod's multigres.com/spec-hash annotation against the hash of the currently desired spec.
Skip if drain in progress — Rolling updates are deferred if any pod is currently being drained (scale-down takes precedence).
Update replicas first — Non-primary drifted pods are selected first.
Primary last — When only the primary remains, a controlled switchover is needed before draining and recreating. (Note: the switchover is coordinated via the shard controller using the same drain annotation mechanism.)
One at a time — Only one pod is drained per reconcile cycle. The reconciler returns early after initiating a drain, waiting for the pod to reach ready-for-deletion before proceeding to the next.
RollingUpdate status condition — A RollingUpdate condition is set on the Shard to track progress (e.g., "2/5 pods updated").

7. Spec-Hash Drift Detection

Since most pod spec fields are immutable after creation, the operator uses a hash-based approach to detect drift:

At pod creation, ComputeSpecHash produces an FNV-1a hex string over all operator-managed fields:
- Container images, commands, args, env vars, resources, volume mounts, security contexts
- Pod affinity, node selector, volume definitions, termination grace period
- The multigres.com/postgres-config-hash annotation value (SHA-256 of the referenced ConfigMap key data), when a postgresConfigRef is set. This ensures ConfigMap content changes trigger rolling updates even though the pod spec itself is unchanged.
The hash is stored as annotation multigres.com/spec-hash on the pod.
On each reconcile, podNeedsUpdate builds the desired pod and computes its hash. If it doesn't match the existing pod's annotation, the pod is flagged for update.

This avoids false positives from admission controllers (Istio, Linkerd, Vault) that inject sidecars or environment variables — only fields the operator explicitly sets are included in the hash.

8. PVC Management

Data PVCs

Each pool pod gets its own data PVC named data-{base-name}-{index}. These PVCs:

Are created before the pod (if missing).
Persist across pod deletions — a restarted pod reattaches to its existing PVC and finds its PGDATA intact.
Are deleted during scale-down only if PVCDeletionPolicy.WhenScaled is Delete.
During cluster/shard deletion, PVCs are garbage-collected by Kubernetes via conditional owner references when PVCDeletionPolicy.WhenDeleted is Delete. When the policy is Retain, PVCs have no ownerRef and persist after deletion.

Shared Backup PVC

One shared PVC per shard per cell for filesystem-based backups, mounted at /backups on all pods. For S3 backups, this is replaced with an EmptyDir volume.

⚠️ Multi-Attach Limitation with RWO Storage:

When using standard block storage (EBS gp2/gp3) with ReadWriteOnce access mode and WaitForFirstConsumer binding, the sequential pod creation model causes Multi-Attach errors for multi-replica cells:

Pod-0 is created, scheduled to Node A. The CSI driver provisions the EBS volume in Node A's AZ and attaches it.

Pod-0 becomes Ready. The operator creates Pod-1 on the next reconcile.

Pod-1 hits the scheduler. The PV's nodeAffinity constrains it to the same AZ, but the scheduler does not detect the active RWO attachment on Node A. If a second node exists in the AZ, the scheduler may place Pod-1 there.

The attachdetach-controller attempts to attach the volume to Node B and fails with Multi-Attach error.

Why this didn't happen with StatefulSets (v0.2.6): The old architecture used ParallelPodManagement, which submitted all pods to the scheduler simultaneously. With WaitForFirstConsumer, the PVC was still unbound when all pods entered the scheduling queue. The first pod to be processed annotated the PVC with volume.kubernetes.io/selected-node, constraining subsequent pods to the same AZ. With one node per AZ (common in dev/test EKS clusters), all pods landed on the same node deterministically. This prevented Multi-Attach errors but caused silent HA loss — all replicas on a single node. The current parallel creation approach has the same scheduling characteristic.

Solutions:

S3 (Recommended for production): No shared PVC needed — the operator uses EmptyDir for scratch space and all pods connect to S3 independently.

RWX Storage: Use a StorageClass that supports ReadWriteMany (NFS, EFS, CephFS).

Single replica per cell: With replicasPerCell: 1, only one pod mounts the PVC and the issue does not arise.

See the Backup Architecture document for the shared PVC design rationale.

PVC Deletion Policy

The PVCDeletionPolicy type (whenDeleted, whenScaled) controls lifecycle. WhenDeleted defaults to Retain for data safety; WhenScaled defaults to Delete so pgbackrest is the source of truth for recovery. This is no longer tied to Kubernetes's StatefulSetPersistentVolumeClaimRetentionPolicy — the operator manages PVC deletion directly.

PVC Volume Expansion

The operator supports in-place PVC volume expansion for data and backup PVCs. When a user increases storage.size on a pool, the operator patches the existing PVC's spec.resources.requests.storage to the new value. Kubernetes and the CSI driver handle the actual block device expansion.

Requirements:

The StorageClass used by the PVC must have allowVolumeExpansion: true. Without this, the Kubernetes API server will reject the PVC patch and the operator will emit a Warning event.
Volume expansion is grow-only. Decreasing storage.size is rejected at admission time — PVC shrinks are not supported by Kubernetes.

How it works:

The operator detects that the desired storage.size exceeds the current PVC spec.
The PVC spec is patched in-place (no pod restart needed for block device expansion).
For CSI drivers that support online filesystem expansion (EBS CSI ≥ v1.5, GCE PD CSI), the filesystem grows live without pod restart.
For CSI drivers that require pod restart for filesystem expansion, the operator detects the FileSystemResizePending PVC condition and initiates a drain on the affected pod via the existing drain state machine.

Note: The same mechanism applies to shared backup PVCs when using filesystem-based backups.

9. Pod Disruption Budgets

For each pool-cell combination, the operator creates a PodDisruptionBudget with:

maxUnavailable: 1 — At most one pod can be voluntarily evicted at a time.
Label selector matching pool pods in that cell.

This protects against node drains or Kubernetes upgrades taking down too many replicas simultaneously. The PDB is owned by the Shard CR for garbage collection.

10. Status Aggregation

The updatePoolsStatus function directly lists pods by label selector and aggregates:

Status Field	Source
`shard.Status.PoolsReady`	`true` when `readyPods == totalPods > 0`
`shard.Status.ReadyReplicas`	Count of pods with `PodReady=True` condition
`shard.Status.OrchReady`	MultiOrch deployment readiness
`shard.Status.Phase`	`Degraded` if any pod is crash-looping; `Healthy` when both pools and orch are ready; `Progressing` otherwise. See Phase Lifecycle.
`shard.Status.Cells`	Observed cells from pod labels
`shard.Status.PodRoles`	Pod → role mapping (populated by the shard controller from etcd)
`shard.Status.LastBackupTime`	Timestamp of last completed backup (from `GetBackups` RPC)
`shard.Status.LastBackupType`	Type of last backup (full/diff/incr)
`shard.Status.Conditions`	`Available`, `BackupHealthy`, `DatabaseRegistered`, `RollingUpdate`

Terminating pods (with a non-zero DeletionTimestamp) and draining pods (with drain annotations) are excluded from ready counts.

Metrics are emitted per pool via monitoring.SetShardPoolReplicas(). A PoolEmpty warning event is emitted when a cell has zero ready replicas but the desired count is > 0. Backup age is emitted via monitoring.SetLastBackupAge().

11. Feature Summary

Implemented

Feature	Details
Direct pod creation and recreation	Pods created per-index, auto-recreated on failure
Parallel pod creation	All missing pods created in a single reconcile pass for fast bootstrap
Per-pod data PVCs	Explicitly created, named by index
Shared backup PVC	Per-shard-per-cell, skipped for S3
Headless Service for DNS	Pod hostname + subdomain for FQDN resolution
PodDisruptionBudgets	`maxUnavailable: 1` per pool per cell
Drain state machine	Annotation-based, coordinated within the shard controller
Pod selection for scale-down	Primary avoidance, prefer non-ready, highest index
Rolling updates	Spec-hash drift detection, replicas first, primary last
Owner-reference-based PVC cleanup	Conditional ownerRefs on PVCs enable Kubernetes GC cascade-delete per `PVCDeletionPolicy`
Shard deletion handling	Deletes pods and deployments directly (no drain); PVCs garbage-collected via ownerRefs per policy
Status aggregation from pods	Direct pod count, no StatefulSet intermediary
Zone/region scheduling	`nodeSelector` injection from `CellTopologyLabels`
Cell topology propagation	Labels carry cluster/db/tg/shard/pool/cell hierarchy
pgBackRest TLS certificates	Auto-generated or user-provided (cert-manager compatible)
S3 and filesystem backup support	Full backup configuration propagation
PVC deletion policy	Hierarchical merge, Retain/Delete per whenDeleted/whenScaled
Etcd topology cleanup	`UnregisterMultiPooler` called during drain flow; stale entries removed on pod termination
Topology registration & pruning	Cell and database registration centralized in MultigresCluster controller; stale entries pruned when `topologyPruning.enabled` (default)
Backup health reporting	Shard controller calls `GetBackups` RPC, sets `BackupHealthy` condition and `LastBackupTime` status
DRAINED pod handling	DRAINED pods (diverged data, pg_rewind failure) are kept alive for admin investigation. Stand-in replicas created at next index for availability. Admin discards via `kubectl delete pod`, triggering drain + PVC deletion
Scale-down health gate	Drains deferred when pool has non-ready pods to prevent cascading failures
Observability	Events, conditions, metrics, tracing spans

Not Yet Implemented (Blocked on Upstream Multigres)

These features are designed and documented in pod-management-design.md but require upstream multigres changes:

Feature	Gap	Notes
WAL archiving failure detection	Gap 1	Needs multipooler to expose `pg_stat_archiver` via etcd
Standby waiting-for-backup surfacing	Gap 4	Needs multipooler to expose `monitor_reason` via etcd
Point-in-Time Recovery	Gap 9	Separate product feature, not urgent

Not Implemented (Design Decision)

Feature	Reason
Scheduled base backups	Per design review meeting, kept peripheral to operator

12. Code Organization

All pod management code lives in pkg/resource-handler/controller/shard/:

File	Purpose
`shard_controller.go`	Main reconciler. Orchestrates pool reconciliation, sets up watches (including ConfigMap watches via `enqueueFromPostgresConfigMap` for `postgresConfigRef` changes), PVC owner-reference management.
`reconcile_pool_pods.go`	Core pod lifecycle: `reconcilePoolPods`, `createMissingResources`, `handleScaleDown`, `handleRollingUpdates`, `selectPodToDrain`, `cleanupDrainedPod`, `podNeedsUpdate`.
`reconcile_deletion.go`	Shard deletion: `handleDeletion`, topology unregistration, child resource cleanup.
`reconcile_multiorch.go`	MultiOrch deployment and service reconciliation.
`reconcile_shared_infra.go`	Shared infrastructure: pgBackRest TLS certs, pg_hba ConfigMap, postgres password Secret, shared backup PVC, PDB, headless service.
`reconcile_data_plane.go`	Data-plane reconciliation: pod role reporting, drain state machine execution, backup health evaluation.
`pool_pod.go`	Pod builder: `BuildPoolPod`, `BuildPoolPodName`, `ComputeSpecHash`.
`pool_pvc.go`	PVC builders: `BuildPoolDataPVC`, `BuildPoolDataPVCName`, `BuildSharedBackupPVC` (with conditional ownerRef support).
`pool_pdb.go`	PDB builder: `BuildPoolPodDisruptionBudget`.
`pool_service.go`	Headless service builder for DNS resolution.
`drain_helpers.go`	Drain utilities: `resolvePodRole` (reads `PodRoles` from shard status), `initiateDrain` (sets drain annotation).
`labels.go`	Label builder: `buildPoolLabelsWithCell` — creates the standard label set for pool resources.
`status.go`	Status aggregation: `updateStatus`, `updatePoolsStatus` (counts pods directly), `updateMultiOrchStatus`, `setConditions`.
`containers.go`	Container builders for pgctld and multipooler. Injects `POSTGRES_INITDB_ARGS` env var on pgctld when `shard.Spec.InitdbArgs` is set. Mounts the postgresql.conf ConfigMap and passes `--postgres-config-template` to pgctld.
`configmap.go`	ConfigMap builders: `BuildPgHbaConfigMap` (pg_hba template), `BuildPostgresConfigConfigMap` (postgresql.conf template with user overrides from `shard.Spec.PostgresConfig` appended).
`multiorch.go`	MultiOrch deployment and service builders.
`doc.go`	Package documentation.

Test Files

File	Coverage
`pool_pod_test.go`	Pod builder, spec-hash computation, naming
`pool_pvc_test.go`	PVC builder, naming, storage class handling, ownerRef tests
`pool_service_test.go`	Headless service labels and selector
`shard_controller_test.go`	Unit tests for reconcile behavior (envtest-based)
`shard_controller_internal_test.go`	Internal function tests (podNeedsUpdate, etc.)
`reconcile_deletion_test.go`	Deletion handler envtest tests
`reconcile_deletion_internal_test.go`	Deletion handler internal tests
`integration_test.go`	Full integration tests: scale up/down, PDB creation, ConfigMap, services
`containers_test.go`	Container builder tests
`configmap_test.go`	pg_hba ConfigMap generation
`multiorch_test.go`	MultiOrch deployment/service builders
`secret_test.go`	Postgres password secret
`ports_test.go`	Port constant tests

13. Naming Conventions

Pod Names

{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-{index}

Generated by BuildPoolPodName using PodConstraints (60 chars max for the base name, then -{index} suffix appended). The hash is an 8-character FNV-1a hex suffix ensuring uniqueness even after truncation.

Data PVC Names

data-{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-{index}

Same structure as pod names with a data- prefix.

Shared Backup PVC Names

backup-data-{cluster}-{db}-{tg}-{shard}-{cell}-{hash}

PDB Names

{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-pdb

Headless Service Names

{cluster}-{db}-{tg}-{shard}-pool-{pool}-{cell}-{hash}-headless

Pods reference this via spec.subdomain, combined with spec.hostname (set to the pod name) to produce FQDNs like pod-name.headless-svc.namespace.svc.cluster.local.

14. Interaction with Multigres

Etcd Topology

The operator reads from etcd topology (via the shard controller) to:

Determine pod roles (PRIMARY, REPLICA, DRAINED) for scale-down and rolling-update decisions.
Clean up stale topology entries on permanent pod removal (UnregisterMultiPooler).

The operator writes to etcd topology (via the MultigresCluster controller) to:

Register cells and databases during initial setup and on every reconcile.
Prune stale cells and databases that no longer exist in the spec (when topologyPruning.enabled, the default).

The operator writes to etcd topology (via the shard controller) to:

Unregister poolers during drain cleanup.

gRPC Operations

Through the shard controller:

UpdateSynchronousStandbyList(REMOVE) — Removes a standby before pod deletion.
UnregisterMultiPooler — Deletes the etcd entry for a permanently removed pod.

What Multigres Handles Autonomously

Operation	Component	Notes
Primary election (bootstrap)	multiorch	`BootstrapShardAction` via consensus
Failover	multiorch	`AppointLeaderAction` on primary failure
Replication setup	multiorch	Detects new poolers, configures streaming replication
Sync standby list (ADD)	multiorch	`FixReplicationAction`
WAL archiving	PostgreSQL	`archive_command` → pgBackRest
Backup execution	multiadmin	Selects a replica, calls `Backup` RPC
Replica restore	multipooler	`MonitorPostgres` detects empty PGDATA, restores from backup

What the Operator Handles

Operation	How
Pod creation and recreation	Direct pod management via reconcile loop
Scale-up (new replicas)	Create PVC + Pod, multigres handles the rest
Scale-down	Drain state machine with standby removal + etcd unregistration
Rolling update	Spec-hash detection, ordered recreation
DRAINED pod handling	Detects DRAINED role from etcd, keeps pod alive for investigation, creates stand-in replica
Backup health reporting	Calls `GetBackups` RPC, sets `BackupHealthy` condition
PVC lifecycle	Direct creation/deletion per policy
Certificate provisioning	`pkg/cert` for pgBackRest TLS
Status reporting	Aggregate from pod conditions, etcd roles, and backup metadata

15. Known Gaps and Future Work

Upstream Dependencies (GitHub Issues Filed)

Gap	Issue	Description
WAL archiving failure detection	#654	Multipooler should report `pg_stat_archiver` status via etcd
Standby stuck waiting for backup	#652	Multipooler should expose `monitor_reason` in etcd

Operator-Side Future Work

Scheduled base backups: Designed as a controller-timer approach (like CloudNativePG) but deferred per team decision.
Individual shard deletion drain cleanup: When a single shard is deleted within a live cluster (not full cluster teardown), pooler entries may be left in topo because handleDeletion skips the drain state machine and deletes pods directly. Addressing this would require running the drain flow during deletion to unregister individual poolers before pod removal. Not addressed now because the operator currently only supports single-shard, single-database clusters — shards are only ever deleted as part of full cluster teardown where the topo server is also destroyed.

Design Constraints (v1alpha1)

These constraints are enforced via CEL validation and prevent unsupported operations:

Constraint	Enforcement
Single database (`postgres`)	CEL on databases array
Single shard (`0-inf`)	CEL on shard name
Pools are append-only	CEL prevents pool removal or rename
Cells are append-only	CEL prevents cell removal (on MultigresCluster and Shard)
Zone/region immutability	CEL on Cell spec fields

FilesExpand file tree

pod-management-architecture.md

Latest commit

History