Skip to content

Data loss with membership changes and partitions #7545

@aphyr

Description

@aphyr

Observed behavior

In nats-server 2.12.1, membership changes, process kills, and network partitions can cause the loss of some or all acknowledged records to a Jetstream stream, even with sync-interval always.

For example, take this test run, in which five out of 5236 acknowledged records published in the middle of the test were lost. The following plot shows published records arranged by the time a client sent the publish message for that record. Each record is a single point, and the color indicates whether the record was "ok" (eventually read), "lost" (acknowledged but never read) or "unknown" (the publish operation was not acknowledged, and the message was never read). Records are spread over the vertical axis to show approximate throughput over time. The data loss event is visible as a small red streak around 16 seconds:

Image

This test involves five nodes: n1 through n5. Around 16 seconds into the test, we partitioned node n1 away from the other four, then killed nodes n1, n2, and n3, then removed node n4, in quick succession. While this was occurring, we published records 7-63, 8-63, 9-63,11-63, and 12-63, all of which were acknowledged successfully by their respective nodes.

{:index 2022, :time 16109960074, :type :ok, :process 4, :f :publish, :value "4-63"}
{:index 2023, :time 16112167773, :type :invoke, :process 5, :f :publish, :value "5-63"}
{:index 2024, :time 16114235936, :type :info, :process :nemesis, :f :start-partition, :value [:isolated {"n1" #{"n2" "n5" "n4" "n3"}, "n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}}]}
{:index 2025, :time 16114263329, :type :info, :process :nemesis, :f :kill, :value :majority}
{:index 2026, :time 16122492214, :type :invoke, :process 7, :f :publish, :value "7-63"}
{:index 2027, :time 16123329080, :type :ok, :process 7, :f :publish, :value "7-63"}
{:index 2028, :time 16124539204, :type :invoke, :process 8, :f :publish, :value "8-63"}
{:index 2029, :time 16125058438, :type :ok, :process 8, :f :publish, :value "8-63"}
{:index 2030, :time 16132755900, :type :invoke, :process 9, :f :publish, :value "9-63"}
{:index 2031, :time 16133716428, :type :ok, :process 9, :f :publish, :value "9-63"}
{:index 2032, :time 16134796130, :type :invoke, :process 10, :f :publish, :value "10-63"}
{:index 2033, :time 16145849910, :type :invoke, :process 11, :f :publish, :value "11-63"}
{:index 2034, :time 16146687908, :type :ok, :process 11, :f :publish, :value "11-63"}
{:index 2035, :time 16164675317, :type :invoke, :process 12, :f :publish, :value "12-63"}
{:index 2036, :time 16165734007, :type :ok, :process 12, :f :publish, :value "12-63"}
{:index 2037, :time 16179824530, :type :info, :process :nemesis, :f :kill, :value {"n1" :killed, "n3" :killed, "n2" :killed}}
{:index 2038, :time 16179871864, :type :info, :process :nemesis, :f :leave, :value "n4"}

After a few hundred seconds of additional faults, the cluster recovered, and was able to acknowledge more published messages. At the end of the test we rejoined all nodes, and, on each node, attempted to read every message from the stream using repeated calls to fetch. These reads observed stream values like:

{:index 15568, :time 319383763266, :type :invoke, :process 362, :f :fetch, :value nil}
{:index 15569, :time 319384992110, :type :ok, :process 362, :f :fetch, :value ["1-58" ... "4-63" "158-5" "158-6" ...]}

Note that value 4-63, acknowledged immediately prior to the partition at 16 seconds, is present in the final read. However, values like 7-63, which were also acknowledged as successful, are missing. Instead, the stream's values jump directly to 158-5, which was written much later in the test.

The problem here is not necessarily that NATS lost data--we should expect that some combinations of faults should be capable of breaking the consensus mechanism permanently. The problem is that NATS lost data silently--it went on to accept more reads and writes after this time as if nothing had gone wrong!

I'm still trying to narrow down the conditions under which this can happen. I've got lots of cases of total data loss, but that may actually be OK, given that the test may have removed too nodes from the cluster. If you've got any insight, I'd love to hear about it! :-)

Expected behavior

NATS should not violate consensus; once data loss has occurred, it should refuse further writes and reads.

Server and client version

This is with nats-server 2.12.1, and the jnats client at version 2.24.0.

Host environment

These nodes are running in LXC containers, on Linux Mint 6.8.0-85-generic, AMD x64.

Steps to reproduce

You can reproduce these results with the Jepsen test harness for NATS, at commit 50fae2030ac3995703bccca6ae72bc35b9bce8b6, by running:

lein run test-all --nemesis membership,partition,kill --time-limit 300 --leave-db-running --version 2.12.1 --sync-interval always --rate 100 --test-count 50 --no-lazyfs

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions