Skip to content

Corruption of $SYS snapshot files on a minority can cause Jetstream nodes to permanently delete streams, taking down the cluster #7556

@aphyr

Description

@aphyr

Observed behavior

In version 2.12.1, with a five-node cluster and replication-factor 5, a Jetstream stream using sync-interval always which experiences single-bit errors in the snap files for a stream, in addition to process crashes, can cause Jetstream to completely delete all data about a stream. In this test run, nodes n3 and n5 had some of their $SYS/_js/ snapshot files corrupted.

[1010859] 2025/11/15 20:26:59.365457 [INF] 10.0.1.147:33042 - rid:91 - Router connection closed: Client Closed
[1010859] 2025/11/15 20:27:02.947432 [INF] Self is new JetStream cluster metadata leader
[1010859] 2025/11/15 20:27:14.996174 [WRN] Detected orphaned stream 'jepsen > jepsen-stream', will cleanup

Sure enough n3 winds up with a completely empty data directory for the jepsen stream. n5 went through something similar when n3 became the metadata leader.

At the end of the test we ensure every node is started. Despite the corruption being limited to just two nodes, the cluster somehow never recovered quorum, causing the entire system to remain unavailable for the duration of the test. Here are the three nodes which were unaffected:

Here's n1:

[1042449] 2025/11/15 20:26:59.913104 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:02.947427 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1042449] 2025/11/15 20:27:04.913410 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:04.914097 [WRN] Error applying entries to 'jepsen > jepsen-stream': catchup failed, too many retries
[1042449] 2025/11/15 20:27:04.914111 [WRN] RAFT [fjFyEjc1 - S-R5F-e98gw6bs] Draining and replaying snapshot

Here's n2:

[1017273] 2025/11/15 20:27:02.947554 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1017273] 2025/11/15 20:27:04.204766 [INF] JetStream cluster new stream leader for 'jepsen > jepsen-stream'
[1017273] 2025/11/15 20:27:04.914839 [INF] Catchup for stream 'jepsen > jepsen-stream' complete

Here's n4:

[1016805] 2025/11/15 20:27:02.947289 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1016805] 2025/11/15 20:27:04.201304 [WRN] JetStream cluster stream 'jepsen > jepsen-stream' has NO quorum, stalled

For the final read phase, all attempts to subscribe to the Jepsen stream returned:

2025-11-15 14:27:28,171{GMT}    INFO    [jepsen worker 12] jepsen.nats.queue: Can't subscribe, [SUB-90007] No matching streams for subject.
Image

Expected behavior

File corruption on a minority of nodes should not cause the cluster to lose quorum entirely.

Server and client version

This is with NATS 2.12.1, and jnats 2.24.0

Host environment

These tests are running in LXC containers on version 2.12.1

Steps to reproduce

With the NATS Jepsen test suite, at version 5daa78754b8a75d972d0d1c951a36d1558527cdc, try:

lein run test-all --nemesis bitflip-file-chunks,kill --time-limit 600 --leave-db-running --version 2.12.1 --sync-interval always --rate 10000 --no-lazyfs --final-time-limit 300 --corrupt-file-target snap

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions