Skip to content

Commit 57436b2

Browse files
armruleonardocemnencia
authored
fix(cluster): strip stale primary label during failover (cloudnative-pg#10409)
When the operator initiates a failover, the old primary's pod keeps its `cnpg.io/instanceRole=primary` label until `ReconcileMetadata` runs. But `ReconcileMetadata` is skipped during the entire failover window (the `CurrentPrimary != TargetPrimary` guard returns early), so the `-rw` service keeps routing to the old primary. If the old primary comes back (e.g. after a temporary network partition), replicas reconnect through the `-rw` service, satisfy the sync quorum, and writes committed on the stale primary are lost to `pg_rewind`. Introduce a third value for the instance role label, `unhealthy`, and apply it to the old primary as soon as failover starts. Since neither the `-rw` nor the `-ro` service selector matches `unhealthy`, the pod is immediately isolated from all service traffic for the duration of the failover window. `ReconcileMetadata` restores the `replica` label once `CurrentPrimary == TargetPrimary`. The label is applied best-effort at the point where failover is initiated, and re-applied on every pass of the reconcile loop while the failover is in progress so transient API errors are retried automatically. Note: stripping the label removes the pod from the service Endpoints, but does not drop TCP connections already established by a replica's walreceiver. This fix closes the reconnection window; established connections must still be terminated by the Postgres-level promotion on the new primary. Closes cloudnative-pg#10403 Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
1 parent f774cb9 commit 57436b2

10 files changed

Lines changed: 215 additions & 11 deletions

File tree

contribute/technical-architecture.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -214,9 +214,11 @@ Networking is purely label-driven. The Operator manages the
214214

215215
2. **Replica:** Labeled as `replica`. The `-ro` Service selects these Pods.
216216

217-
3. **Failover:** When a new primary is promoted via `pg_ctl promote`, the
218-
Operator updates the labels. The Kubernetes API Server then automatically
219-
updates the **Endpoints** for the respective Services.
217+
3. **Unhealthy:** Transient value applied to the old primary during a
218+
failover or switchover, so that neither the `-rw` nor the `-ro`
219+
Service selects it. Once the transition completes, the Operator
220+
relabels the demoted instance as `replica` and the Kubernetes API
221+
Server updates the Service **Endpoints** accordingly.
220222

221223
## Source Code Reference
222224

docs/src/labels_annotations.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -102,15 +102,21 @@ This label is available only on `VolumeSnapshot` resources.
102102
default users created by CloudNativePG (typically `postgres` and `app`).
103103

104104
`role` - **deprecated**
105-
: Whether the instance running in a pod is a `primary` or a `replica`.
106-
This label is deprecated, you should use `cnpg.io/instanceRole` instead.
105+
: Role of the instance running in a pod: `primary`, `replica`, or
106+
`unhealthy`. The `unhealthy` value is transient: the operator sets
107+
it on the old primary during a failover or switchover and clears it
108+
automatically once the transition completes. This label is deprecated,
109+
you should use `cnpg.io/instanceRole` instead.
107110

108111
`cnpg.io/scheduled-backup`
109112
: When available, name of the `ScheduledBackup` resource that created a given
110113
`Backup` object.
111114

112115
`cnpg.io/instanceRole`
113-
: Whether the instance running in a pod is a `primary` or a `replica`.
116+
: Role of the instance running in a pod: `primary`, `replica`, or
117+
`unhealthy`. The `unhealthy` value is transient: the operator sets
118+
it on the old primary during a failover or switchover and clears it
119+
automatically once the transition completes.
114120

115121
`app.kubernetes.io/managed-by`
116122
: Name of the manager. It will always be `cloudnative-pg`.

internal/controller/cluster_controller.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,18 @@ func (r *ClusterReconciler) reconcile(ctx context.Context, cluster *apiv1.Cluste
367367

368368
if cluster.Status.CurrentPrimary != "" &&
369369
cluster.Status.CurrentPrimary != cluster.Status.TargetPrimary {
370+
// Mark the old primary as unhealthy on every pass while failover is
371+
// in progress. This retries each second until it succeeds,
372+
// complementing the immediate best-effort attempt in replicas.go.
373+
if err := r.markOldPrimaryAsUnhealthy(
374+
ctx, cluster.Status.CurrentPrimary, resources.instances.Items,
375+
); err != nil {
376+
contextLogger.Warning(
377+
"Failed to strip primary label from old primary, will retry",
378+
"oldPrimary", cluster.Status.CurrentPrimary,
379+
"error", err)
380+
}
381+
370382
contextLogger.Info("There is a switchover or a failover "+
371383
"in progress, waiting for the operation to complete",
372384
"currentPrimary", cluster.Status.CurrentPrimary,

internal/controller/replicas.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,22 @@ func (r *ClusterReconciler) reconcileTargetPrimaryForNonReplicaCluster(
140140
if err != nil {
141141
return "", err
142142
}
143+
144+
// Mark the old primary as unhealthy immediately when failover starts,
145+
// removing it from both the -rw and -ro services. This prevents replicas
146+
// from reconnecting to it (primary_conninfo uses <cluster>-rw) and
147+
// satisfying the synchronous replication quorum on a stale primary.
148+
// Best-effort: the failover must proceed even if this fails. The
149+
// retryable call in the reconcile loop's failover guard will correct
150+
// the label on subsequent passes.
151+
if err := r.markOldPrimaryAsUnhealthy(
152+
ctx,
153+
cluster.Status.CurrentPrimary,
154+
resources.instances.Items,
155+
); err != nil {
156+
contextLogger.Error(err, "Failed to strip primary label from old primary, continuing with failover",
157+
"oldPrimary", cluster.Status.CurrentPrimary)
158+
}
143159
}
144160

145161
// Wait until all the WAL receivers are down. This is needed to avoid losing the WAL
@@ -180,6 +196,39 @@ func (r *ClusterReconciler) reconcileTargetPrimaryForNonReplicaCluster(
180196
return mostAdvancedInstance.Pod.Name, r.setPrimaryInstance(ctx, cluster, mostAdvancedInstance.Pod.Name)
181197
}
182198

199+
// markOldPrimaryAsUnhealthy labels the old primary pod as unhealthy when failover
200+
// starts, removing it from both the -rw and -ro service selectors until
201+
// ReconcileMetadata restores the correct label after promotion completes.
202+
func (r *ClusterReconciler) markOldPrimaryAsUnhealthy(
203+
ctx context.Context,
204+
oldPrimaryName string,
205+
instances []corev1.Pod,
206+
) error {
207+
contextLogger := log.FromContext(ctx)
208+
209+
idx := slices.IndexFunc(instances, func(pod corev1.Pod) bool {
210+
return pod.Name == oldPrimaryName
211+
})
212+
if idx == -1 {
213+
contextLogger.Warning(
214+
"Old primary pod not found in managed instances, skipping label demotion",
215+
"oldPrimary", oldPrimaryName)
216+
return nil
217+
}
218+
219+
oldPrimary := &instances[idx]
220+
if role, _ := utils.GetInstanceRole(oldPrimary.Labels); role == specs.ClusterRoleLabelUnhealthy {
221+
return nil
222+
}
223+
224+
contextLogger.Info(
225+
"Setting primary label to unhealthy in the old primary during failover",
226+
"pod", oldPrimary.Name)
227+
origPod := oldPrimary.DeepCopy()
228+
utils.SetInstanceRole(&oldPrimary.ObjectMeta, specs.ClusterRoleLabelUnhealthy)
229+
return r.Patch(ctx, oldPrimary, client.MergeFrom(origPod))
230+
}
231+
183232
// isNodeUnschedulableOrBeingDrained checks if a node is currently being drained.
184233
// Copied from https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/7bacf2d36f397bd098b3388403e8759c480be7e5/cmd/hooks/prestop.go#L91
185234
//
@@ -334,6 +383,11 @@ func (r *ClusterReconciler) reconcileTargetPrimaryForReplicaCluster(
334383
return "", err
335384
}
336385

386+
// Unlike the non-replica path, we do not strip the old primary label here:
387+
// a designated primary does not accept application writes via the -rw
388+
// service, so the split-brain window #10403 guards against does not
389+
// apply. The retryable call in the reconcile loop's failover guard still
390+
// relabels the pod on its next pass.
337391
return status.Items[0].Pod.Name, r.setPrimaryInstance(ctx, cluster, status.Items[0].Pod.Name)
338392
}
339393

internal/controller/replicas_test.go

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,18 @@ SPDX-License-Identifier: Apache-2.0
2020
package controller
2121

2222
import (
23+
"context"
24+
"fmt"
25+
2326
corev1 "k8s.io/api/core/v1"
2427
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
28+
"sigs.k8s.io/controller-runtime/pkg/client"
29+
"sigs.k8s.io/controller-runtime/pkg/client/fake"
30+
"sigs.k8s.io/controller-runtime/pkg/client/interceptor"
2531

2632
apiv1 "github.com/cloudnative-pg/cloudnative-pg/api/v1"
2733
"github.com/cloudnative-pg/cloudnative-pg/pkg/postgres"
34+
"github.com/cloudnative-pg/cloudnative-pg/pkg/specs"
2835
"github.com/cloudnative-pg/cloudnative-pg/pkg/utils"
2936

3037
. "github.com/onsi/ginkgo/v2"
@@ -129,6 +136,116 @@ var _ = Describe("Sacrificial Pod detection", func() {
129136
})
130137
})
131138

139+
var _ = Describe("markOldPrimaryAsUnhealthy", func() {
140+
var env *testingEnvironment
141+
142+
BeforeEach(func() {
143+
env = buildTestEnvironment()
144+
})
145+
146+
makePod := func(name, namespace, role string) corev1.Pod {
147+
pod := corev1.Pod{
148+
ObjectMeta: metav1.ObjectMeta{
149+
Name: name,
150+
Namespace: namespace,
151+
Labels: map[string]string{},
152+
},
153+
}
154+
if role != "" {
155+
utils.SetInstanceRole(&pod.ObjectMeta, role)
156+
}
157+
return pod
158+
}
159+
160+
It("changes the primary label from the old primary pod", func() {
161+
ctx := context.Background()
162+
namespace := newFakeNamespace(env.client)
163+
164+
primary := makePod("cluster-1", namespace, specs.ClusterRoleLabelPrimary)
165+
replica1 := makePod("cluster-2", namespace, specs.ClusterRoleLabelReplica)
166+
replica2 := makePod("cluster-3", namespace, specs.ClusterRoleLabelReplica)
167+
168+
for i, pod := range []corev1.Pod{primary, replica1, replica2} {
169+
p := pod
170+
Expect(env.client.Create(ctx, &p)).To(Succeed())
171+
// refresh the local copy with server-assigned fields
172+
if i == 0 {
173+
primary = p
174+
}
175+
}
176+
177+
pods := []corev1.Pod{primary, replica1, replica2}
178+
179+
err := env.clusterReconciler.markOldPrimaryAsUnhealthy(ctx, "cluster-1", pods)
180+
Expect(err).ToNot(HaveOccurred())
181+
182+
// Verify the old primary's label was changed to unhealthy on the API server
183+
var updated corev1.Pod
184+
Expect(env.client.Get(ctx, client.ObjectKeyFromObject(&primary), &updated)).To(Succeed())
185+
Expect(updated.Labels[utils.ClusterInstanceRoleLabelName]).To(Equal(specs.ClusterRoleLabelUnhealthy))
186+
//nolint:staticcheck
187+
Expect(updated.Labels[utils.ClusterRoleLabelName]).To(Equal(specs.ClusterRoleLabelUnhealthy))
188+
189+
// Verify replica pods are unchanged
190+
var replica1Updated corev1.Pod
191+
Expect(env.client.Get(ctx, client.ObjectKeyFromObject(&replica1), &replica1Updated)).To(Succeed())
192+
Expect(replica1Updated.Labels[utils.ClusterInstanceRoleLabelName]).To(Equal(specs.ClusterRoleLabelReplica))
193+
})
194+
195+
It("does not error when the old primary is not in the pod list", func() {
196+
ctx := context.Background()
197+
namespace := newFakeNamespace(env.client)
198+
199+
replica := makePod("cluster-2", namespace, specs.ClusterRoleLabelReplica)
200+
Expect(env.client.Create(ctx, &replica)).To(Succeed())
201+
202+
err := env.clusterReconciler.markOldPrimaryAsUnhealthy(ctx, "cluster-1", []corev1.Pod{replica})
203+
Expect(err).ToNot(HaveOccurred())
204+
})
205+
206+
It("is a no-op when the old primary already has the unhealthy label", func() {
207+
ctx := context.Background()
208+
namespace := newFakeNamespace(env.client)
209+
210+
pod := makePod("cluster-1", namespace, specs.ClusterRoleLabelUnhealthy)
211+
Expect(env.client.Create(ctx, &pod)).To(Succeed())
212+
213+
err := env.clusterReconciler.markOldPrimaryAsUnhealthy(ctx, "cluster-1", []corev1.Pod{pod})
214+
Expect(err).ToNot(HaveOccurred())
215+
216+
var updated corev1.Pod
217+
Expect(env.client.Get(ctx, client.ObjectKeyFromObject(&pod), &updated)).To(Succeed())
218+
Expect(updated.Labels[utils.ClusterInstanceRoleLabelName]).To(Equal(specs.ClusterRoleLabelUnhealthy))
219+
})
220+
221+
It("surfaces the Patch error so callers can apply their best-effort or retry strategy", func() {
222+
ctx := context.Background()
223+
namespace := newFakeNamespace(env.client)
224+
225+
primary := makePod("cluster-1", namespace, specs.ClusterRoleLabelPrimary)
226+
227+
failingClient := fake.NewClientBuilder().
228+
WithScheme(env.scheme).
229+
WithObjects(&primary).
230+
WithInterceptorFuncs(interceptor.Funcs{
231+
Patch: func(_ context.Context, _ client.WithWatch, obj client.Object,
232+
_ client.Patch, _ ...client.PatchOption,
233+
) error {
234+
Expect(obj).To(BeAssignableToTypeOf(&corev1.Pod{}))
235+
Expect(obj.GetName()).To(Equal("cluster-1"))
236+
Expect(obj.GetNamespace()).To(Equal(namespace))
237+
return fmt.Errorf("simulated API server error")
238+
},
239+
}).
240+
Build()
241+
242+
r := &ClusterReconciler{Client: failingClient, Scheme: env.scheme}
243+
244+
err := r.markOldPrimaryAsUnhealthy(ctx, "cluster-1", []corev1.Pod{primary})
245+
Expect(err).To(MatchError(ContainSubstring("simulated API server error")))
246+
})
247+
})
248+
132249
var _ = Describe("Check pods not on primary node", func() {
133250
item1 := postgres.PostgresqlStatus{
134251
IsPrimary: false,

pkg/reconciler/instance/metadata.go

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,15 +196,20 @@ func updateRoleLabels(
196196
if !hasRole || podRole != specs.ClusterRoleLabelPrimary || !newHasRole ||
197197
newPodRole != specs.ClusterRoleLabelPrimary {
198198
contextLogger.Info("Setting primary label", "pod", instance.Name)
199-
utils.SetInstanceRole(instance.ObjectMeta, specs.ClusterRoleLabelPrimary)
199+
utils.SetInstanceRole(&instance.ObjectMeta, specs.ClusterRoleLabelPrimary)
200200
return true
201201
}
202202

203203
default:
204+
// This intentionally overwrites the transient ClusterRoleLabelUnhealthy value
205+
// that the failover path sets on the old primary. This function is only reached
206+
// once CurrentPrimary == TargetPrimary (the failover guard in the reconcile loop
207+
// returns early otherwise), so by the time we get here the old primary has been
208+
// demoted and "replica" is the correct label.
204209
if !hasRole || podRole != specs.ClusterRoleLabelReplica || !newHasRole ||
205210
newPodRole != specs.ClusterRoleLabelReplica {
206211
contextLogger.Info("Setting replica label", "pod", instance.Name)
207-
utils.SetInstanceRole(instance.ObjectMeta, specs.ClusterRoleLabelReplica)
212+
utils.SetInstanceRole(&instance.ObjectMeta, specs.ClusterRoleLabelReplica)
208213
return true
209214
}
210215
}

pkg/reconciler/majorupgrade/reconciler.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ func createMajorUpgradeJob(
223223
cluster.GetFixedInheritedLabels(), configuration.Current)
224224
utils.InheritLabels(&job.Spec.Template.ObjectMeta, cluster.Labels,
225225
cluster.GetFixedInheritedLabels(), configuration.Current)
226-
utils.SetInstanceRole(job.Spec.Template.ObjectMeta, specs.ClusterRoleLabelPrimary)
226+
utils.SetInstanceRole(&job.Spec.Template.ObjectMeta, specs.ClusterRoleLabelPrimary)
227227

228228
contextLogger.Info("Creating new major upgrade Job",
229229
"jobName", job.Name,

pkg/reconciler/persistentvolumeclaim/metadata.go

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,11 @@ func reconcileInstanceRoleLabel(
8282
return nil
8383
}
8484
for _, instanceName := range cluster.Status.InstanceNames {
85+
// PVCs inherit the role label from the instance name, independently of
86+
// the pod's current label. The failover guard in the reconcile loop
87+
// prevents this code from running while CurrentPrimary != TargetPrimary,
88+
// so by the time we get here the old primary has already been demoted
89+
// and "replica" is correct.
8590
instanceRole := specs.ClusterRoleLabelReplica
8691
if instanceName == cluster.Status.CurrentPrimary {
8792
instanceRole = specs.ClusterRoleLabelPrimary
@@ -101,7 +106,7 @@ func reconcileInstanceRoleLabel(
101106
return true
102107
},
103108
update: func(pvc *corev1.PersistentVolumeClaim) {
104-
utils.SetInstanceRole(pvc.ObjectMeta, instanceRole)
109+
utils.SetInstanceRole(&pvc.ObjectMeta, instanceRole)
105110
},
106111
}
107112

pkg/specs/pods.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,9 @@ const (
7070
// ClusterRoleLabelReplica is written in labels to represent replica servers
7171
ClusterRoleLabelReplica = "replica"
7272

73+
// ClusterRoleLabelUnhealthy is applied to the old primary when a failover starts.
74+
ClusterRoleLabelUnhealthy = "unhealthy"
75+
7376
// PostgresContainerName is the name of the container executing PostgreSQL
7477
// inside one Pod
7578
PostgresContainerName = "postgres"

pkg/utils/labels_annotations.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -553,7 +553,7 @@ func GetInstanceRole(labels map[string]string) (string, bool) {
553553
}
554554

555555
// SetInstanceRole sets both ClusterRoleLabelName and ClusterInstanceRoleLabelName on the given ObjectMeta
556-
func SetInstanceRole(meta metav1.ObjectMeta, role string) {
556+
func SetInstanceRole(meta *metav1.ObjectMeta, role string) {
557557
if meta.Labels == nil {
558558
meta.Labels = map[string]string{}
559559
}

0 commit comments

Comments
 (0)