Component(s)
prometheus.remote_write
What's wrong?
Deleting the current WAL segment causes a silent watcher stall. A restart cannot recover; the component crash-loops.
This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.
Steps to reproduce
I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.
The repro-silent-stall.sh script runs both reproductions, and pauses in between phase 1 and phase 2 for verification. The attached tarball has a docker-compose file and Alloy configuration.
reproduction.tgz
mkdir -p ./data
docker compose up -d # start Alloy + stub
sleep 60 # warmup: at least one rotation
./repro-silent-stall.sh # phase 1: deleting WAL segment
# phase 2: restart to try to fix it
Finding 1: Deleting the current WAL segment causes a silent watcher stall
Repro: ./repro-silent-stall.sh (phase 1)
One-line repro: with Alloy running, rm the highest-numbered segment file from the WAL directory.
rm ./data/prometheus.remote_write.stub/wal/$(ls data/prometheus.remote_write.stub/wal/ | grep '^[0-9]' | tail -1)
What happens
- The writer holds segment 7 open.
rm unlinks the directory entry but the open fd keeps the inode alive.
- On the next rotation tick, the writer creates segment 8 — but segment 7 is no longer in the directory listing. The on-disk segment list is now
[..., 6, 8] with a gap.
- The WAL watcher's segment-list discovery rejects the directory with
segments are not sequential and refuses to advance past 7.
- The truncator hits the same check and refuses to checkpoint.
- The writer keeps appending samples to the (now orphaned) segment 8 inode —
wal_samples_appended_total grows.
samples_total flatlines at the value reached just after the rm (T+30 s onward in the trace above): the in-memory queue drains, then nothing more is ever sent to remote_write.
Finding 2: A restart cannot recover; component crash-loops
Repro: ./repro-silent-stall.sh (phase 2, script should pause after phase 1 to allow verification before proceeding)
One-line repro: with the WAL in the post-Finding-1 state (numeric gap on disk), docker compose restart alloy.
What happens
The prometheus.remote_write constructor calls get segment range against the WAL dir before it can start. The same segments are not sequential check fires and the build fails. Alloy exits 1, Docker restarts it, the next attempt hits the identical state on disk, and the loop is permanent.
System information
MacOS 26.4.1 + Docker
Software version
Grafana Alloy v1.16.0
Configuration
// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
// - A scrape source that produces enough series to grow the WAL and trigger
// checkpoints on the schedule below.
// - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).
prometheus.scrape "self" {
targets = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
forward_to = [prometheus.remote_write.stub.receiver]
scrape_interval = "1s"
scrape_timeout = "500ms"
}
prometheus.remote_write "stub" {
endpoint {
url = "http://stub:8080/api/v1/push"
}
wal {
truncate_frequency = "30s"
min_keepalive_time = "10s"
max_keepalive_time = "1m"
}
}
Logs
T+10 s samples_out=66429 wal_appended=67637 watcher_cur_seg=7
T+20 s samples_out=69751 wal_appended=70657 watcher_cur_seg=7
T+30 s samples_out=70959 wal_appended=73677 watcher_cur_seg=7
T+45 s samples_out=70959 wal_appended=78207 watcher_cur_seg=7
T+60 s samples_out=70959 wal_appended=82737 watcher_cur_seg=7
# trying to restart
Container state: restarting restarts=6
Startup error from alloy logs:
Error: /etc/alloy/config.alloy:16:1: Failed to build component:
building component: get segment range: segments are not sequential
Error: could not perform the initial load successfully
Tip
React with 👍 if this issue is important to you.
Component(s)
prometheus.remote_write
What's wrong?
Deleting the current WAL segment causes a silent watcher stall. A restart cannot recover; the component crash-loops.
This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.
Steps to reproduce
I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the
walsettings have short truncate_frequency and keepalive times, again for ease of reproduction.The
repro-silent-stall.shscript runs both reproductions, and pauses in between phase 1 and phase 2 for verification. The attached tarball has a docker-compose file and Alloy configuration.reproduction.tgz
Finding 1: Deleting the current WAL segment causes a silent watcher stall
Repro:
./repro-silent-stall.sh(phase 1)One-line repro: with Alloy running,
rmthe highest-numbered segment file from the WAL directory.rm ./data/prometheus.remote_write.stub/wal/$(ls data/prometheus.remote_write.stub/wal/ | grep '^[0-9]' | tail -1)What happens
rmunlinks the directory entry but the open fd keeps the inode alive.[..., 6, 8]with a gap.segments are not sequentialand refuses to advance past 7.wal_samples_appended_totalgrows.samples_totalflatlines at the value reached just after the rm (T+30 s onward in the trace above): the in-memory queue drains, then nothing more is ever sent to remote_write.Finding 2: A restart cannot recover; component crash-loops
Repro:
./repro-silent-stall.sh(phase 2, script should pause after phase 1 to allow verification before proceeding)One-line repro: with the WAL in the post-Finding-1 state (numeric gap on disk),
docker compose restart alloy.What happens
The
prometheus.remote_writeconstructor callsget segment rangeagainst the WAL dir before it can start. The samesegments are not sequentialcheck fires and the build fails. Alloy exits 1, Docker restarts it, the next attempt hits the identical state on disk, and the loop is permanent.System information
MacOS 26.4.1 + Docker
Software version
Grafana Alloy v1.16.0
Configuration
Logs
Tip
React with 👍 if this issue is important to you.