-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Observed behavior
In version 2.12.1, with a five-node cluster and replication-factor 5, a Jetstream stream using sync-interval always which experiences single-bit errors in the snap files for a stream, in addition to process crashes, can cause Jetstream to completely delete all data about a stream. In this test run, nodes n3 and n5 had some of their $SYS/_js/ snapshot files corrupted.
[1010859] 2025/11/15 20:26:59.365457 [INF] 10.0.1.147:33042 - rid:91 - Router connection closed: Client Closed
[1010859] 2025/11/15 20:27:02.947432 [INF] Self is new JetStream cluster metadata leader
[1010859] 2025/11/15 20:27:14.996174 [WRN] Detected orphaned stream 'jepsen > jepsen-stream', will cleanup
Sure enough n3 winds up with a completely empty data directory for the jepsen stream. n5 went through something similar when n3 became the metadata leader.
At the end of the test we ensure every node is started. Despite the corruption being limited to just two nodes, the cluster somehow never recovered quorum, causing the entire system to remain unavailable for the duration of the test. Here are the three nodes which were unaffected:
Here's n1:
[1042449] 2025/11/15 20:26:59.913104 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:02.947427 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1042449] 2025/11/15 20:27:04.913410 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:04.914097 [WRN] Error applying entries to 'jepsen > jepsen-stream': catchup failed, too many retries
[1042449] 2025/11/15 20:27:04.914111 [WRN] RAFT [fjFyEjc1 - S-R5F-e98gw6bs] Draining and replaying snapshot
Here's n2:
[1017273] 2025/11/15 20:27:02.947554 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1017273] 2025/11/15 20:27:04.204766 [INF] JetStream cluster new stream leader for 'jepsen > jepsen-stream'
[1017273] 2025/11/15 20:27:04.914839 [INF] Catchup for stream 'jepsen > jepsen-stream' complete
Here's n4:
[1016805] 2025/11/15 20:27:02.947289 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1016805] 2025/11/15 20:27:04.201304 [WRN] JetStream cluster stream 'jepsen > jepsen-stream' has NO quorum, stalled
For the final read phase, all attempts to subscribe to the Jepsen stream returned:
2025-11-15 14:27:28,171{GMT} INFO [jepsen worker 12] jepsen.nats.queue: Can't subscribe, [SUB-90007] No matching streams for subject.
Expected behavior
File corruption on a minority of nodes should not cause the cluster to lose quorum entirely.
Server and client version
This is with NATS 2.12.1, and jnats 2.24.0
Host environment
These tests are running in LXC containers on version 2.12.1
Steps to reproduce
With the NATS Jepsen test suite, at version 5daa78754b8a75d972d0d1c951a36d1558527cdc, try:
lein run test-all --nemesis bitflip-file-chunks,kill --time-limit 600 --leave-db-running --version 2.12.1 --sync-interval always --rate 10000 --no-lazyfs --final-time-limit 300 --corrupt-file-target snap