Single-bit errors in .blk files triggers partial loss of acknowledged writes

### Observed behavior

In version 2.12.1, with a five-node cluster and replication-factor 5, a Jetstream stream using sync-interval `always` which experiences single-bit errors in the `blk` files for a stream, even if those errors are restricted to a minority of nodes, can cause the loss of some or all acknowledged records. For example, [this test](https://s3.amazonaws.com/jepsen.io/analyses/nats-2.12.1/0251113T155209.701-bitflip-loss.zip) lost 20,350 acknowledged records from the middle of the test, but retained 299,525 records before and after.

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/1c3de62d-8fab-47b0-af9a-1a30ab1e801d" />

In [this run](https://s3.amazonaws.com/jepsen.io/analyses/nats-2.12.1/20251113T161729.579-bitflip-loss-2.zip), 159,427 acknowledged records were lost from the the middle. By this I mean that each of those records was acknowledged, never read, *and* some process observed a record written later by the same process--this violates a claimed NATS ordering guarantee. 287,249 acknowledged records were lost in total; about half were in a lost postfix of the log, rather than in the middle.

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/14920f1e-c609-4419-b014-664c4b562857" />

A few log entries might be of interest:

```
[630995] 2025/11/13 22:19:41.162552 [WRN] Filestore [S-R5F-eI3z0amB] Stream state outdated, last block has additional entries, will rebuild
[630995] 2025/11/13 22:19:41.162561 [WRN] Filestore [S-R5F-eI3z0amB] Recovering stream state from index errored: prior state file
```

```
[641421] 2025/11/13 22:19:40.939994 [INF]   Starting restore for stream 'jepsen > jepsen-stream'
[641421] 2025/11/13 22:19:41.057012 [INF]   Restored 529,384 messages for stream 'jepsen > jepsen-stream' in 117ms
```

### Expected behavior

NATS has a checksum mechanism which sometimes--but not always!--detects file corruption and refuses to start the node. It should probably prevent silent data loss here too. It's also somewhat worrying that this occurs even though a majority of nodes have totally intact data files!

### Server and client version

This is with NATS 2.12.1, and jnats 2.24.0

### Host environment

_No response_

### Steps to reproduce

You can reproduce this with commit 3f7d648 of the [Jepsen NATS test](https://github.com/jepsen-io/nats) by running

```
lein run test-all --nemesis kill,bitflip-file-chunks --time-limit 120 --leave-db-running --version 2.12.1 --sync-interval always --rate 10000 --no-lazyfs --test-count 100
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Single-bit errors in .blk files triggers partial loss of acknowledged writes #7549

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Single-bit errors in .blk files triggers partial loss of acknowledged writes #7549

Description

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions