Skip to content

Single-bit errors in .blk files triggers partial loss of acknowledged writes #7549

@aphyr

Description

@aphyr

Observed behavior

In version 2.12.1, with a five-node cluster and replication-factor 5, a Jetstream stream using sync-interval always which experiences single-bit errors in the blk files for a stream, even if those errors are restricted to a minority of nodes, can cause the loss of some or all acknowledged records. For example, this test lost 20,350 acknowledged records from the middle of the test, but retained 299,525 records before and after.

Image

In this run, 159,427 acknowledged records were lost from the the middle. By this I mean that each of those records was acknowledged, never read, and some process observed a record written later by the same process--this violates a claimed NATS ordering guarantee. 287,249 acknowledged records were lost in total; about half were in a lost postfix of the log, rather than in the middle.

Image

A few log entries might be of interest:

[630995] 2025/11/13 22:19:41.162552 [WRN] Filestore [S-R5F-eI3z0amB] Stream state outdated, last block has additional entries, will rebuild
[630995] 2025/11/13 22:19:41.162561 [WRN] Filestore [S-R5F-eI3z0amB] Recovering stream state from index errored: prior state file
[641421] 2025/11/13 22:19:40.939994 [INF]   Starting restore for stream 'jepsen > jepsen-stream'
[641421] 2025/11/13 22:19:41.057012 [INF]   Restored 529,384 messages for stream 'jepsen > jepsen-stream' in 117ms

Expected behavior

NATS has a checksum mechanism which sometimes--but not always!--detects file corruption and refuses to start the node. It should probably prevent silent data loss here too. It's also somewhat worrying that this occurs even though a majority of nodes have totally intact data files!

Server and client version

This is with NATS 2.12.1, and jnats 2.24.0

Host environment

No response

Steps to reproduce

You can reproduce this with commit 3f7d648 of the Jepsen NATS test by running

lein run test-all --nemesis kill,bitflip-file-chunks --time-limit 120 --leave-db-running --version 2.12.1 --sync-interval always --rate 10000 --no-lazyfs --test-count 100

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions