Skip to content

Silent watcher stall + crash-loop on missing WAL segment in prometheus.remote.write #6165

@maxlemieux

Description

@maxlemieux

Component(s)

prometheus.remote_write

What's wrong?

Deleting the current WAL segment causes a silent watcher stall. A restart cannot recover; the component crash-loops.

This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.

Steps to reproduce

I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the wal settings have short truncate_frequency and keepalive times, again for ease of reproduction.

The repro-silent-stall.sh script runs both reproductions, and pauses in between phase 1 and phase 2 for verification. The attached tarball has a docker-compose file and Alloy configuration.

reproduction.tgz

mkdir -p ./data
docker compose up -d                           # start Alloy + stub
sleep 60                                       # warmup: at least one rotation
./repro-silent-stall.sh                        # phase 1: deleting WAL segment
                                               # phase 2: restart to try to fix it

Finding 1: Deleting the current WAL segment causes a silent watcher stall

Repro: ./repro-silent-stall.sh (phase 1)

One-line repro: with Alloy running, rm the highest-numbered segment file from the WAL directory.

rm ./data/prometheus.remote_write.stub/wal/$(ls data/prometheus.remote_write.stub/wal/ | grep '^[0-9]' | tail -1)

What happens

  1. The writer holds segment 7 open. rm unlinks the directory entry but the open fd keeps the inode alive.
  2. On the next rotation tick, the writer creates segment 8 — but segment 7 is no longer in the directory listing. The on-disk segment list is now [..., 6, 8] with a gap.
  3. The WAL watcher's segment-list discovery rejects the directory with segments are not sequential and refuses to advance past 7.
  4. The truncator hits the same check and refuses to checkpoint.
  5. The writer keeps appending samples to the (now orphaned) segment 8 inode — wal_samples_appended_total grows.
  6. samples_total flatlines at the value reached just after the rm (T+30 s onward in the trace above): the in-memory queue drains, then nothing more is ever sent to remote_write.

Finding 2: A restart cannot recover; component crash-loops

Repro: ./repro-silent-stall.sh (phase 2, script should pause after phase 1 to allow verification before proceeding)

One-line repro: with the WAL in the post-Finding-1 state (numeric gap on disk), docker compose restart alloy.

What happens

The prometheus.remote_write constructor calls get segment range against the WAL dir before it can start. The same segments are not sequential check fires and the build fails. Alloy exits 1, Docker restarts it, the next attempt hits the identical state on disk, and the loop is permanent.

System information

MacOS 26.4.1 + Docker

Software version

Grafana Alloy v1.16.0

Configuration

// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
//   - A scrape source that produces enough series to grow the WAL and trigger
//     checkpoints on the schedule below.
//   - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).

prometheus.scrape "self" {
  targets         = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
  forward_to      = [prometheus.remote_write.stub.receiver]
  scrape_interval = "1s"
  scrape_timeout  = "500ms"
}

prometheus.remote_write "stub" {
  endpoint {
    url = "http://stub:8080/api/v1/push"
  }

  wal {
    truncate_frequency = "30s"
    min_keepalive_time = "10s"
    max_keepalive_time = "1m"
  }
}

Logs

T+10 s  samples_out=66429   wal_appended=67637   watcher_cur_seg=7
  T+20 s  samples_out=69751   wal_appended=70657   watcher_cur_seg=7
  T+30 s  samples_out=70959   wal_appended=73677   watcher_cur_seg=7
  T+45 s  samples_out=70959   wal_appended=78207   watcher_cur_seg=7
  T+60 s  samples_out=70959   wal_appended=82737   watcher_cur_seg=7

# trying to restart
Container state: restarting restarts=6

Startup error from alloy logs:
  Error: /etc/alloy/config.alloy:16:1: Failed to build component:
         building component: get segment range: segments are not sequential
  Error: could not perform the initial load successfully

Tip

React with 👍 if this issue is important to you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions