Component(s)
prometheus.remote_write
What's wrong?
A one-byte non-zero write into the page-padded region of any checkpoint segment file causes every subsequent truncate cycle to fail. WAL segments accumulate forever; the data path keeps working, so the failure is invisible in the standard remote_write counters.
This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.
Note: the only way to spot this issue seems to be in logs, I didn't find metrics that expose the condition.
Steps to reproduce
I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.
The repro-truncation-disabled.sh script runs the reproduction. The attached tarball has a docker-compose file and Alloy configuration.
reproduction.tar.gz
mkdir -p ./data
docker compose up -d # start Alloy + stub
sleep 60 # warmup: at least one rotation
./repro-truncation-disabled.sh # write a non zero byte into the page-padded region of a checkpoint segment
Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation
Repro: ./repro-truncation-disabled.sh
One-line repro: with at least one checkpoint.NNN/ directory present, write any non-zero byte into the page-padded region (offset ≥ 32768) of a checkpoint segment file.
printf '\xff' | dd of=./data/.../wal/checkpoint.00000005/00000000 bs=1 count=1 seek=32768 conv=notrunc
What happens
Each truncation cycle (every truncate_frequency) tries to create a new checkpoint by reading the existing one and merging in newer segments. The read fails immediately on the corrupt page, so the new checkpoint is never written, and no segments are reaped. The same error fires every 30 s for the lifetime of the process. Throughput on the data path is unaffected — the watcher does not need to re-read this checkpoint to ship samples — but on-disk segments accumulate forever.
System information
MacOS 26.4.1, Docker
Software version
Alloy v1.16.0
Configuration
// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
// - A scrape source that produces enough series to grow the WAL and trigger
// checkpoints on the schedule below.
// - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).
prometheus.scrape "self" {
targets = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
forward_to = [prometheus.remote_write.stub.receiver]
scrape_interval = "1s"
scrape_timeout = "500ms"
}
prometheus.remote_write "stub" {
endpoint {
url = "http://stub:8080/api/v1/push"
}
wal {
truncate_frequency = "30s"
min_keepalive_time = "10s"
max_keepalive_time = "1m"
}
}
Logs
ts=...:46:45 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:46:45 level=warn msg="could not truncate WAL"
err="create checkpoint: read segments: corruption in segment .../checkpoint.00000005/00000000
at 32768: unexpected non-zero byte in padded page"
ts=...:47:15 level=info msg="Creating checkpoint" from_segment=6 to_segment=8 ...
ts=...:47:15 level=warn msg="could not truncate WAL"
err="... unexpected non-zero byte in padded page"
Sample throughput (data path unaffected):
samples_out=113239 watcher_cur_seg=12
Tip
React with 👍 if this issue is important to you.
Component(s)
prometheus.remote_write
What's wrong?
A one-byte non-zero write into the page-padded region of any checkpoint segment file causes every subsequent truncate cycle to fail. WAL segments accumulate forever; the data path keeps working, so the failure is invisible in the standard remote_write counters.
This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.
Note: the only way to spot this issue seems to be in logs, I didn't find metrics that expose the condition.
Steps to reproduce
I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.
The repro-truncation-disabled.sh script runs the reproduction. The attached tarball has a docker-compose file and Alloy configuration.
reproduction.tar.gz
Finding: A single-byte corruption inside a checkpoint file permanently disables WAL truncation
Repro:
./repro-truncation-disabled.shOne-line repro: with at least one
checkpoint.NNN/directory present, write any non-zero byte into the page-padded region (offset ≥ 32768) of a checkpoint segment file.What happens
Each truncation cycle (every
truncate_frequency) tries to create a new checkpoint by reading the existing one and merging in newer segments. The read fails immediately on the corrupt page, so the new checkpoint is never written, and no segments are reaped. The same error fires every 30 s for the lifetime of the process. Throughput on the data path is unaffected — the watcher does not need to re-read this checkpoint to ship samples — but on-disk segments accumulate forever.System information
MacOS 26.4.1, Docker
Software version
Alloy v1.16.0
Configuration
Logs
Tip
React with 👍 if this issue is important to you.