prometheus.remote_write WAL truncation permanently disabled by single-byte checkpoint corruption

### Component(s)

prometheus.remote_write

### What's wrong?

A one-byte non-zero write into the page-padded region of any checkpoint segment file causes every subsequent truncate cycle to fail. WAL segments accumulate forever; the data path keeps working, so the failure is invisible in the standard remote_write counters.

This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.

Note: the only way to spot this issue seems to be in logs, I didn't find metrics that expose the condition.

### Steps to reproduce

I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.

The repro-truncation-disabled.sh script runs the reproduction. The attached tarball has a docker-compose file and Alloy configuration.

[reproduction.tar.gz](https://github.com/user-attachments/files/27262113/reproduction.tar.gz)

```
mkdir -p ./data
docker compose up -d                           # start Alloy + stub
sleep 60                                       # warmup: at least one rotation
./repro-truncation-disabled.sh                 # write a non zero byte into the page-padded region of a checkpoint segment
```

## Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation

**Repro:** `./repro-truncation-disabled.sh`

**One-line repro:** with at least one `checkpoint.NNN/` directory present, write any non-zero byte into the page-padded region (offset ≥ 32768) of a checkpoint segment file.

```sh
printf '\xff' | dd of=./data/.../wal/checkpoint.00000005/00000000 bs=1 count=1 seek=32768 conv=notrunc
```

**What happens**

Each truncation cycle (every `truncate_frequency`) tries to create a new checkpoint by reading the existing one and merging in newer segments. The read fails immediately on the corrupt page, so the new checkpoint is never written, and no segments are reaped. The same error fires every 30 s for the lifetime of the process. Throughput on the data path is unaffected — the watcher does not need to re-read this checkpoint to ship samples — but on-disk segments accumulate forever.

### System information

MacOS 26.4.1, Docker

### Software version

Alloy v1.16.0

### Configuration

```text
// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
//   - A scrape source that produces enough series to grow the WAL and trigger
//     checkpoints on the schedule below.
//   - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).

prometheus.scrape "self" {
  targets         = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
  forward_to      = [prometheus.remote_write.stub.receiver]
  scrape_interval = "1s"
  scrape_timeout  = "500ms"
}

prometheus.remote_write "stub" {
  endpoint {
    url = "http://stub:8080/api/v1/push"
  }

  wal {
    truncate_frequency = "30s"
    min_keepalive_time = "10s"
    max_keepalive_time = "1m"
  }
}
```

### Logs

```text
ts=...:46:45 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:46:45 level=warn msg="could not truncate WAL"
             err="create checkpoint: read segments: corruption in segment .../checkpoint.00000005/00000000
                  at 32768: unexpected non-zero byte in padded page"
ts=...:47:15 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:47:15 level=warn msg="could not truncate WAL"
             err="... unexpected non-zero byte in padded page"

Sample throughput (data path unaffected):
  samples_out=113239  watcher_cur_seg=12
```

### Tip

<sub>React with 👍 if this issue is important to you.</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus.remote_write WAL truncation permanently disabled by single-byte checkpoint corruption #6166

Component(s)

What's wrong?

Steps to reproduce

Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation

System information

Software version

Configuration

Logs

Tip

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

prometheus.remote_write WAL truncation permanently disabled by single-byte checkpoint corruption #6166

Description

Component(s)

What's wrong?

Steps to reproduce

Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation

System information

Software version

Configuration

Logs

Tip

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions