Skip to content

When using quorum synchronisation, PGO should ensure that at least one node acking any transaction commit is on a different AZ to that of the primary instance #4149

Open
@matto4096

Description

@matto4096

Overview

Given:

  • A Kubernetes cluster with N availability zones (AZs)
  • A Postgres Cluster using quorum synchronization and num_sync = M
  • The Postgres Cluster has X nodes, where X > (N * M)

Then:

  • PGO should ensure that at least one of the M nodes acking any transaction commit is on a different AZ to that of the primary instance, to ensure data is stored durably in 2 different AZs

Use Case

Many users spread their clusters across failure domains to prevent data loss and ensure availability during a failure in a single domain.

Consider a Postgres cluster with:

  • Six nodes (1 primary, 5 replicas), spread evenly across some AWS region that has 3 availability zones.
  • Quorum synchronisation, with a synchronous_node_count of 1

The topology might look like this:
AZ 1: node0 (Primary), node1 (Replica)
AZ 2: node2 (Replica), node3 (Replica)
AZ 3: node4 (Replica), node5 (Replica)

synchronous_standby_names = 'ANY 1 (node1,node2,node3,node4,node5)'

The problem here is that the replica in AZ 1 (node1) could become the synchronous standby. This is a major issue because if this happens, data loss will occur in the event that AZ 1 goes down.

To solve this, PGO should ensure that the replica in the same AZ as the primary never becomes the synchronous replica, ensuring that the latest data is present in at least 2 failure domains at all times.

At scheduling time this is easy - simply apply the Patroni nosync tag to any replicas in the same AZ as the primary.

AZ 1: node0 (Primary), node1 (Replica - nosync)
AZ 2: node2 (Replica), node3 (Replica)
AZ 3: node4 (Replica), node5 (Replica)

However: the primary could lose the leader lock at any time, which would result in a topology like this:

AZ 1: node0 (Replica), node1 (Replica - nosync)
AZ 2: node2 (Primary), node3 (Replica)
AZ 3: node4 (Replica), node5 (Replica)

This is exactly the same problem we started with—i.e., that it is possible for the primary and synchronous replica to exist in the same failure domain. In fact, it’s quite likely that a replica on the same AZ will become the sync replica because it will have lower latency than replicas in other AZs.

To solve this problem, it seems that PGO would need to dynamically apply and remove the nosync label from nodes at various points in the lifecycle of a cluster:

  1. When a new leader is elected, and the new leader is in a different AZ to the previous leader
  2. When a new replica is added to the Postgres cluster and that replica is in the same AZ as the current leader

This should be sufficient to prevent replicas in the same AZ as the primary ever becoming the synchronous replica, thus eliminating the risk of a single point of failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions