Skip to content

prometheus.remote_write WAL truncation permanently disabled by single-byte checkpoint corruption #6166

@maxlemieux

Description

@maxlemieux

Component(s)

prometheus.remote_write

What's wrong?

A one-byte non-zero write into the page-padded region of any checkpoint segment file causes every subsequent truncate cycle to fail. WAL segments accumulate forever; the data path keeps working, so the failure is invisible in the standard remote_write counters.

This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.

Note: the only way to spot this issue seems to be in logs, I didn't find metrics that expose the condition.

Steps to reproduce

I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.

The repro-truncation-disabled.sh script runs the reproduction. The attached tarball has a docker-compose file and Alloy configuration.

reproduction.tar.gz

mkdir -p ./data
docker compose up -d                           # start Alloy + stub
sleep 60                                       # warmup: at least one rotation
./repro-truncation-disabled.sh                 # write a non zero byte into the page-padded region of a checkpoint segment

Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation

Repro: ./repro-truncation-disabled.sh

One-line repro: with at least one checkpoint.NNN/ directory present, write any non-zero byte into the page-padded region (offset ≥ 32768) of a checkpoint segment file.

printf '\xff' | dd of=./data/.../wal/checkpoint.00000005/00000000 bs=1 count=1 seek=32768 conv=notrunc

What happens

Each truncation cycle (every truncate_frequency) tries to create a new checkpoint by reading the existing one and merging in newer segments. The read fails immediately on the corrupt page, so the new checkpoint is never written, and no segments are reaped. The same error fires every 30 s for the lifetime of the process. Throughput on the data path is unaffected — the watcher does not need to re-read this checkpoint to ship samples — but on-disk segments accumulate forever.

System information

MacOS 26.4.1, Docker

Software version

Alloy v1.16.0

Configuration

// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
//   - A scrape source that produces enough series to grow the WAL and trigger
//     checkpoints on the schedule below.
//   - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).

prometheus.scrape "self" {
  targets         = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
  forward_to      = [prometheus.remote_write.stub.receiver]
  scrape_interval = "1s"
  scrape_timeout  = "500ms"
}

prometheus.remote_write "stub" {
  endpoint {
    url = "http://stub:8080/api/v1/push"
  }

  wal {
    truncate_frequency = "30s"
    min_keepalive_time = "10s"
    max_keepalive_time = "1m"
  }
}

Logs

ts=...:46:45 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:46:45 level=warn msg="could not truncate WAL"
             err="create checkpoint: read segments: corruption in segment .../checkpoint.00000005/00000000
                  at 32768: unexpected non-zero byte in padded page"
ts=...:47:15 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:47:15 level=warn msg="could not truncate WAL"
             err="... unexpected non-zero byte in padded page"

Sample throughput (data path unaffected):
  samples_out=113239  watcher_cur_seg=12

Tip

React with 👍 if this issue is important to you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions