The API design document defines an optional LocalTopoServer as a child of each Cell. When a Cell specifies a localTopoServer (via inline CellSpec, CellTemplate, or the override chain), the Cell controller should create a dedicated TopoServer child CR with its own etcd cluster, providing cell-local topology storage instead of relying on the global topology server.
The relevant design elements:
- The
MultigresClusterresource tree showsLocalTopoServer (Child CR, optional)under each Cell. - The
CellTemplatespec includes an optionallocalTopoServersection withetcdorexternaloptions. - The Cell child CR has three topology options: (1) use global (default, empty
topoServer), (2) inline external, and (3) managed local etcd.
The following plumbing is in place and functional:
- API Types:
LocalTopoServerSpec(intoposerver_types.go) defines the configuration withetcdandexternaloptions, including CEL validation for mutual exclusion. - CellTemplate: The
CellTemplateSpechas aLocalTopoServer *LocalTopoServerSpecfield. - MultigresCluster inline CellSpec: The
CellInlineSpechas aLocalTopoServer *LocalTopoServerSpecfield. - Cell child CR: The
CellSpechas aTopoServer *LocalTopoServerSpecfield ready to receive the resolved config. - Resolver:
pkg/resolver/cell.gofully resolvesLocalTopoServerSpecthrough the 4-level override chain (inline → template → cluster default → namespace default). - Webhook defaulter:
pkg/webhook/handlers/defaulter.gopasses the resolvedlocalTopoSpecinto the Cell CR builder. - Cluster controller:
builders_cell.goacceptslocalTopoSpecas a parameter and sets it on the Cell CR'sspec.topoServer. - CEL Feature Gate: Both
CellInlineSpecandCellTemplateSpechave a CELXValidationrule (!has(self.localTopoServer)) that rejects any attempt to setlocalTopoServer. This prevents silent misconfiguration until the feature is implemented. The rule must be removed and CRDs regenerated when the feature ships.
The Cell controller (pkg/resource-handler/controller/cell/cell_controller.go) only reconciles the MultiGateway Deployment and MultiGateway Service. It does not inspect spec.topoServer or create a child TopoServer CR. The reason is that upstream multigres does not yet support separate per-cell topology servers in practice.
Specifically:
-
Upstream architecture supports it conceptually: The
topoclient.Storein multigres maintains a two-tier topology model — a global connection and per-cell connections viaConnForCell(). TheCellprotobuf (clustermetadata.Cell) hasServerAddressesandRootfields that can point to a separate etcd instance. -
But the
createclustermetadataCLI always reuses the global server: When cells are registered by the upstreammultigres topo createclustermetadatacommand, each cell'sServerAddressesis set to the global topo server address, and the cell'sRootis derived as{globalRoot}/{cellName}. This means all cells share the same etcd cluster, just with different key prefixes. -
No upstream testing or validation of separate per-cell etcd: While
ConnForCell()would technically connect to a different etcd address if a cell had one, this path has never been exercised in production or in the upstream test suite. There may be undiscovered issues with connection lifecycle, failure handling, or data consistency when cells use independent etcd clusters. -
The operator's topology registration: Cell and database registration in the global topology server is centralized in the MultigresCluster controller (
reconcileTopology), not in individual Cell or Shard controllers. The Cell CR includes atopologyReconciliation.registerCellflag that is set by the MultigresCluster controller when building Cell CRs, but the actual registration is performed centrally. This registration currently always sets the cell's topo address to the global topo server. Supporting local topo servers would require the operator to register cells with differentServerAddressespointing to the local etcd.
Once upstream multigres validates and supports per-cell topology servers, the operator implementation would involve:
Delete the XValidation markers from CellInlineSpec (in multigrescluster_types.go) and CellTemplateSpec (in celltemplate_types.go), then run make generate manifests to regenerate CRDs.
Add a reconcileLocalTopoServer step in cell_controller.go before the MultiGateway reconciliation:
// In Reconcile(), before reconciling MultiGateway:
if cell.Spec.TopoServer != nil && cell.Spec.TopoServer.Etcd != nil {
if err := r.reconcileLocalTopoServer(ctx, cell); err != nil {
return ctrl.Result{}, err
}
}This would create a TopoServer child CR (owned by the Cell) using the existing TopoServer controller infrastructure already used for the global topo server. The naming would follow the pattern {cluster}-{cell}-local-topo.
Add Owns(&multigresv1alpha1.TopoServer{}) to the controller builder so the Cell controller watches its child TopoServer.
When registering the cell in the global topology (via createclustermetadata or an equivalent controller-side call), set the cell's ServerAddresses to the local etcd service address instead of the global one:
Cell "us-east-1a":
ServerAddresses: ["{cluster}-{cell}-local-topo-client.{ns}.svc.cluster.local:2379"]
Root: "/multigres/{cell}"
This way, when multigres components call ConnForCell("us-east-1a"), the topoclient.Store would connect to the local etcd instead of the global one.
The multigateway and multiorch binaries already connect to cell topology via ConnForCell(), which reads the cell's ServerAddresses from the global topology. If the operator registers the cell with a local address, pool discovery and orchestration will automatically use the local etcd. No changes to the data-plane flags are required.
The Cell status should reflect the local TopoServer's health when present. This means:
- Adding TopoServer readiness to the Cell's conditions.
- Blocking MultiGateway creation until the local TopoServer is available (since the gateway needs a working topo server to discover poolers).
For spec.topoServer.external, no child CR creation is needed — the cell registration simply uses the provided external endpoints as the cell's ServerAddresses. This is simpler to implement and could be done first as a stepping stone.
A Cell is a logical failure domain — a boundary that upstream multigres uses for consensus, sync replication, and pool discovery. When a pooler or gateway registers itself in the topology server, it carries an ID.Cell field that identifies which cell it belongs to.
Upstream multigres has no awareness of physical infrastructure topology. The Cell protobuf (clustermetadata.Cell) contains only Name, ServerAddresses, and Root — there are no zone, region, or Kubernetes scheduling fields. The mapping of "cell = availability zone" is a convention enforced entirely by the operator.
The durability policy controls how multiorch enforces synchronous replication acknowledgment during failover. It is configurable via spec.durabilityPolicy on the MultigresCluster (cluster-wide default) and per-database via the databases[].durabilityPolicy override. See the Durability Policy user guide for configuration details.
The MULTI_CELL_AT_LEAST_2 policy groups standbys by Id.Cell and requires synchronous replication acknowledgment from N distinct cells. BuildSyncReplicationConfig in multiorch excludes standbys in the same cell as the primary when building the sync standby list for multi-cell policies.
This means cells are the mechanism by which multigres guarantees that a committed write survives the failure of an entire availability zone. If a shard has pools in zone-1 and zone-2 and uses MULTI_CELL_AT_LEAST_2, a write is only acknowledged after replication to a standby in the opposite zone.
However, this guarantee is only real if the cell's pods actually run in the correct zone. If the Kubernetes scheduler places zone-1 pool pods on us-east-1b nodes, the cross-cell quorum is meaningless — both "cells" are on the same physical infrastructure.
The operator natively integrates Cell topology with Kubernetes node scheduling via the zone and region fields on the Cell and CellConfig definitions.
The zone and region fields are optional and mutually exclusive (validated via !(has(self.zone) && has(self.region))).
zoneset → injectsnodeSelector: { topology.kubernetes.io/zone: <zone> }regionset → injectsnodeSelector: { topology.kubernetes.io/region: <region> }- neither set → no
nodeSelectoris injected, and pods will schedule on any available node. This is essential for single-node environments like Kind or Minikube where topology labels do not exist.
The injected nodeSelector is strict and additive:
- Strict: Pods will remain
Pendingif no matching nodes exist, preserving the failure domain guarantee. Silently scheduling to the wrong zone is strictly avoided. - Additive:
nodeSelectordoes not overwrite the user-providedAffinityrules (e.g. fromPoolSpec.AffinityorCellInlineSpec.MultiGateway.Affinity). The Kubernetes scheduler applies both constraints, enabling users to layer intra-zone anti-affinity or topology spread constraints.
The components receiving this injection are:
| Component | Injection Logic |
|---|---|
| MultiGateway Deployment | Determined directly from the Cell CR Spec.Zone / Spec.Region |
| Pool Pods | Looked up via CellTopologyLabels in ShardSpec |
| MultiOrch Deployment | Looked up via CellTopologyLabels in ShardSpec |
(Note: PVCs for backups don't need explicit injection, as Kubernetes binds dynamically provisioned PersistentVolumes based on the pod's scheduling decision through topology-aware volume provisioning).
To avoid independent API reads, the cluster controller builds a CellTopologyLabels map[CellName]map[string]string from the user's MultigresCluster cell configs. This map propagates down through the TableGroupSpec and finally the ShardSpec. The shard controller uses this map to instantaneously inject the appropriate nodeSelector for any given cell without additional API roundtrips.
The operator's validating webhook (ValidateClusterLogic in resolver.go) checks the live cluster for nodes matching the specified topology labels. Specifically, it lists all corev1.Node objects at admission time.
If a cell specifies a zone or region that does not exist on any current nodes, the webhook emits a Kubernetes Admission Warning:
cell 'X': no nodes currently match topology.kubernetes.io/zone=Y; pods will be Pending until matching nodes are available
We use warnings rather than rejections because cluster nodes are ephemeral (e.g., Karpenter or Cluster Autoscaler may spin up a new AZ from zero upon spotting Pending pods).
The existing Affinity field on StatelessSpec (multigateway) and PoolSpec (pools) is preserved and works additively with the auto-injected nodeSelector. Users can use it for:
- Pod anti-affinity — prevent multiple pool replicas from landing on the same node within a zone.
- Topology spread constraints — distribute pods evenly across nodes within a zone.
- Additional scheduling preferences — soft preferences for instance types, etc.