-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Observed behavior
In NATS 2.12.1, with a five-node cluster and replication-factor 5, JetStream streams can end up in split-brain when a single node experiences a simulated power failure. Acknowledged records are visible to consumers connected to some nodes, but not others. This problem is readily reproducible with as little as a single process crash, and some careful process pauses before and after the crash, which helps ensure that specific nodes are the first to form a quorum. I suspect this behavior should also be possible without process pauses, but it should be significantly less frequent.
For example, take this test run, which killed node n3 at approximately 45 seconds, leaving n1, n2, n4, and n5 intact. Before killing n3 we paused n1 and n2, and when n3 was restarted, we resumed those nodes and paused n4 and n5.
Note that there is a brief pause (presumably for leader election) when we first pause n1 and n2. Upon killing and restarting n3 at ~45 seconds, operations fail to complete for a brief time as n3 starts up. Then the cluster begins accepting writes again.
Unfortunately, this was unsafe: n1 and n2 were missing acknowledged writes because they were paused, and n3 lost acknowledged writes, thanks to NATS' choice not to sync records to disk. Allowing n1, n2, and n3 to continue accepting writes after this time caused nodes n1 and n2 to lose roughly five seconds worth of writes immediately prior to the crash, even though those records were present on other nodes! Writes were lost both before and after acknowledged writes performed by the same process, which tells us that the log has holes in it: a violation of NATS' Linearizability claim.
This was meant as a specific example of #7564, which noted that JetStream will acknowledge publish calls even when nodes have not written those records to disk. This is generally unsafe in consensus systems because of scenarios like the one outlined above. However, this case actually led to replica divergence, which is even weirder--I'm filing a separate issue for it.
Sometimes records are missing on n3. In this particular case, they were missing from n1 and n4, which is particularly odd, given the structure of the process pauses. Also, a postfix was missing on n5!
The critical phenomenon here is the process crash. The pauses here are used to control which nodes can become leaders, and to broaden the windows of concurrency--both of which raise our chances to observe data loss. You can get the same effect with network partitions, and I suspect variable network latencies might be sufficient to cause this too.
Expected behavior
NATS replicas should definitely not diverge, and should also not lose data when a single node crashes.
Server and client version
NATS 2.12.1, jnats 2.24.0
Host environment
This is a cluster of LXC nodes running under Jepsen.
Steps to reproduce
With the NATS Jepsen test at 1d295bbc93620522087400660e187c1a733450c6, try:
lein run test --nemesis pause-kill --time-limit 60 --version 2.12.1 --rate 1000 --final-time-limit 30 --sync-interval 10 --lazyfs
This works with any positive sync-interval, but the default two minutes means that we have to wait two minutes before NATS does its initial sync--this shortening the interval lets us run a shorter test.