[NEW] Primary replica role at the slot level

**The problem/use-case that the feature addresses**

Currently, replicas provide availability, increased durability (from a domain failure perspective), and performance improvements when using Read from Replica (RFR). However, performance scaling is limited to stale reads and does not extend to regular writes\reads. Many customers do not use RFR for various reasons. Additionally, when a primary node fails, all write traffic to that node fails [potentially requiring application-level logic to handle the failure].

**Description of the feature**

We propose redefining role assignments from the node level to the slot level. In this model, a node can be the primary for certain slots and a replica for others. This involves adjusting the codebase so that any primary/replica designations are applied to slots rather than nodes. Essentially, the node becomes a logical container of compute, memory, and services that manages atomic data entities (slots).

With this approach, we can scale the performance of both writes and reads based on the number of nodes in a shard, eliminating the concept of a replica node. If a node fails, only the slots for which it was the primary are directly impacted, improving fault granularity\isolation.

The recent introduction of the dict-per-slot has shifted many processes to operate at the slot level, which facilitates the transition to this model. As part of this feature we will need to continue going down this path for other flows in the system, including bgsave, for example.

This change would require client support, but for clients that do not have the support we can initially implement the feature in a degenerated form where all slots in a shard have the same primary node, maintaining backward compatibility.

**Additional information**

An added benefit of this approach is the potential to reduce code complexity by unifying the code paths of replication and slot migration, which are currently two similar processes for maintaining data consistency between nodes.

For Cluster Mode Disabled (CMD), we can consider all data to reside in slot 0. In the long term, we might consider enabling slots (or logical grouping) for CMD, allowing customers to gain the benefits of this model without adopting Cluster Mode Enabled (CME).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NEW] Primary replica role at the slot level #1372

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NEW] Primary replica role at the slot level #1372

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions