Proposal: Oxia Shard Replica Consistency Validation #849
mattisonchao
started this conversation in
Proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
This proposal introduces a mechanism to validate strict data consistency across all replicas (Leader and Followers) of an Oxia shard. By introducing a new ConsistencyCheck WAL entry, we create a deterministic synchronisation point where all replicas capture an atomic snapshot of their LSM state and WAL chain CRC. This feature enables operators to detect silent data corruption, "split-brain" divergence due to non-deterministic application logic (e.g., memory counters), and storage anomalies.
Background knowledge
Oxia operates on the Replicated State Machine (RSM) model. In this model, high availability and consistency are achieved by replicating a write-ahead log (WAL) to multiple servers.
Motivation
While the RSM model guarantees consistency in theory, in practice, replicas can diverge due to software or hardware faults. We have observed issues in production where:
Currently, Oxia lacks a tool to "prove" that Replica A and Replica B contain the exact same data at a specific log index.
(todo...)
Beta Was this translation helpful? Give feedback.
All reactions