Corruption of $SYS snapshot files on a minority can cause Jetstream nodes to permanently delete streams, taking down the cluster

### Observed behavior

In version 2.12.1, with a five-node cluster and replication-factor 5, a Jetstream stream using sync-interval always which experiences single-bit errors in the `snap` files for a stream, in addition to process crashes, can cause Jetstream to completely delete all data about a stream. In [this test run](https://s3.amazonaws.com/jepsen.io/analyses/nats-2.12.1/20251115T142345-snap-bitflip-quorum-break.zip), nodes n3 and n5 had some of their `$SYS/_js/` snapshot files corrupted.

```
[1010859] 2025/11/15 20:26:59.365457 [INF] 10.0.1.147:33042 - rid:91 - Router connection closed: Client Closed
[1010859] 2025/11/15 20:27:02.947432 [INF] Self is new JetStream cluster metadata leader
[1010859] 2025/11/15 20:27:14.996174 [WRN] Detected orphaned stream 'jepsen > jepsen-stream', will cleanup
```

Sure enough `n3` winds up with a completely empty data directory for the `jepsen` stream. n5 went through something similar when `n3` became the metadata leader.

At the end of the test we ensure every node is started. Despite the corruption being limited to just two nodes, the cluster somehow never recovered quorum, causing the entire system to remain unavailable for the duration of the test. Here are the three nodes which were unaffected:

Here's n1:

```
[1042449] 2025/11/15 20:26:59.913104 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:02.947427 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1042449] 2025/11/15 20:27:04.913410 [WRN] Catchup for stream 'jepsen > jepsen-stream' stalled
[1042449] 2025/11/15 20:27:04.914097 [WRN] Error applying entries to 'jepsen > jepsen-stream': catchup failed, too many retries
[1042449] 2025/11/15 20:27:04.914111 [WRN] RAFT [fjFyEjc1 - S-R5F-e98gw6bs] Draining and replaying snapshot
```

Here's n2:

```
[1017273] 2025/11/15 20:27:02.947554 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1017273] 2025/11/15 20:27:04.204766 [INF] JetStream cluster new stream leader for 'jepsen > jepsen-stream'
[1017273] 2025/11/15 20:27:04.914839 [INF] Catchup for stream 'jepsen > jepsen-stream' complete
```

Here's n4:

```
[1016805] 2025/11/15 20:27:02.947289 [INF] JetStream cluster new metadata leader: n3/jepsen-cluster
[1016805] 2025/11/15 20:27:04.201304 [WRN] JetStream cluster stream 'jepsen > jepsen-stream' has NO quorum, stalled
```

For the final read phase, all attempts to subscribe to the Jepsen stream returned:

```
2025-11-15 14:27:28,171{GMT}    INFO    [jepsen worker 12] jepsen.nats.queue: Can't subscribe, [SUB-90007] No matching streams for subject.
```

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/641646ab-5229-4fad-89de-27a686cd2fa1" />

### Expected behavior

File corruption on a minority of nodes should not cause the cluster to lose quorum entirely.

### Server and client version

This is with NATS 2.12.1, and jnats 2.24.0

### Host environment

These tests are running in LXC containers on version 2.12.1 

### Steps to reproduce

With the NATS Jepsen test suite, at version 5daa78754b8a75d972d0d1c951a36d1558527cdc, try:

```
lein run test-all --nemesis bitflip-file-chunks,kill --time-limit 600 --leave-db-running --version 2.12.1 --sync-interval always --rate 10000 --no-lazyfs --final-time-limit 300 --corrupt-file-target snap
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Corruption of $SYS snapshot files on a minority can cause Jetstream nodes to permanently delete streams, taking down the cluster #7556

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Corruption of $SYS snapshot files on a minority can cause Jetstream nodes to permanently delete streams, taking down the cluster #7556

Description

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions