When using quorum synchronisation, PGO should ensure that at least one node acking any transaction commit is on a different AZ to that of the primary instance

## Overview

Given:
* A Kubernetes cluster with N availability zones (AZs)
* A Postgres Cluster using `quorum` synchronization and num_sync = M 
* The Postgres Cluster has X nodes, where X > (N * M)

Then: 
* PGO should ensure that at least one of the M nodes acking any transaction commit is on a different AZ to that of the primary instance, to ensure data is stored durably in 2 different AZs

## Use Case

Many users spread their clusters across failure domains to prevent data loss and ensure availability during a failure in a single domain.

Consider a Postgres cluster with:
* Six nodes (1 primary, 5 replicas), spread evenly across some AWS region that has 3 availability zones.
* Quorum synchronisation, with a `synchronous_node_count` of 1

The topology might look like this:
AZ 1: node0 (Primary), node1 (Replica)
AZ 2: node2 (Replica), node3 (Replica) 
AZ 3: node4 (Replica), node5 (Replica)

`synchronous_standby_names = 'ANY 1 (node1,node2,node3,node4,node5)'` 

The problem here is that the replica in AZ 1 (node1) could become the synchronous standby. This is a major issue because if this happens, data loss will occur in the event that AZ 1 goes down.

To solve this, PGO should ensure that the replica in the same AZ as the primary _never_ becomes the synchronous replica, ensuring that the latest data is present in at least 2 failure domains at all times.

At scheduling time this is easy - simply apply the Patroni `nosync` tag to any replicas in the same AZ as the primary.

AZ 1: node0 (Primary), node1 (Replica - nosync)
AZ 2: node2 (Replica), node3 (Replica) 
AZ 3: node4 (Replica), node5 (Replica)

However: the primary could lose the leader lock at any time, which would result in a topology like this:

AZ 1: node0 (Replica), node1 (Replica - nosync)
AZ 2: node2 (Primary), node3 (Replica) 
AZ 3: node4 (Replica), node5 (Replica)

This is exactly the same problem we started with—i.e., that it is possible for the primary and synchronous replica to exist in the same failure domain. In fact, it’s quite likely that a replica on the same AZ will become the sync replica because it will have lower latency than replicas in other AZs.

To solve this problem, it seems that PGO would need to dynamically apply and remove the `nosync` label from nodes at various points in the lifecycle of a cluster:
1. When a new leader is elected, and the new leader is in a different AZ to the previous leader
2. When a new replica is added to the Postgres cluster and that replica is in the same AZ as the current leader

This should be sufficient to prevent replicas in the same AZ as the primary ever becoming the synchronous replica, thus eliminating the risk of a single point of failure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using quorum synchronisation, PGO should ensure that at least one node acking any transaction commit is on a different AZ to that of the primary instance #4149

Overview

Use Case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When using quorum synchronisation, PGO should ensure that at least one node acking any transaction commit is on a different AZ to that of the primary instance #4149

Description

Overview

Use Case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions