Silent watcher stall + crash-loop on missing WAL segment in prometheus.remote.write

### Component(s)

prometheus.remote_write

### What's wrong?

Deleting the current WAL segment causes a silent watcher stall. A restart cannot recover; the component crash-loops.

This is an edge case which was discovered while trying to reproduce reported failures of Alloy related to this WAL, where the failures were suspected to relate to corrupted filesystems in some way.

### Steps to reproduce

I've attached a test environment that uses a stub service to receive Prometheus Remote Write data, for ease of reproduction. I also reproduced this separately while sending to Grafana Cloud, to verify the results in real Prometheus storage. Note that the `wal` settings have short truncate_frequency and keepalive times, again for ease of reproduction.

The `repro-silent-stall.sh` script runs both reproductions, and pauses in between phase 1 and phase 2 for verification. The attached tarball has a docker-compose file and Alloy configuration.

[reproduction.tgz](https://github.com/user-attachments/files/27261090/reproduction.tgz)

```
mkdir -p ./data
docker compose up -d                           # start Alloy + stub
sleep 60                                       # warmup: at least one rotation
./repro-silent-stall.sh                        # phase 1: deleting WAL segment
                                               # phase 2: restart to try to fix it
```

## Finding 1: Deleting the current WAL segment causes a silent watcher stall

**Repro:** `./repro-silent-stall.sh` (phase 1)

**One-line repro:** with Alloy running, `rm` the highest-numbered segment file from the WAL directory.

```sh
rm ./data/prometheus.remote_write.stub/wal/$(ls data/prometheus.remote_write.stub/wal/ | grep '^[0-9]' | tail -1)
```
**What happens**

1. The writer holds segment 7 open. `rm` unlinks the directory entry but the open fd keeps the inode alive.
2. On the next rotation tick, the writer creates segment 8 — but segment 7 is no longer in the directory listing. The on-disk segment list is now `[..., 6, 8]` with a gap.
3. The WAL watcher's segment-list discovery rejects the directory with `segments are not sequential` and refuses to advance past 7.
4. The truncator hits the same check and refuses to checkpoint.
5. The writer keeps appending samples to the (now orphaned) segment 8 inode — `wal_samples_appended_total` grows.
6. `samples_total` flatlines at the value reached just after the rm (T+30 s onward in the trace above): the in-memory queue drains, then nothing more is ever sent to remote_write.

## Finding 2: A restart cannot recover; component crash-loops

**Repro:** `./repro-silent-stall.sh` (phase 2, script should pause after phase 1 to allow verification before proceeding)

**One-line repro:** with the WAL in the post-Finding-1 state (numeric gap on disk), `docker compose restart alloy`.

**What happens**

The `prometheus.remote_write` constructor calls `get segment range` against the WAL dir before it can start. The same `segments are not sequential` check fires and the build fails. Alloy exits 1, Docker restarts it, the next attempt hits the identical state on disk, and the loop is permanent.

### System information

MacOS 26.4.1 + Docker

### Software version

Grafana Alloy v1.16.0

### Configuration

```text
// Minimal Alloy config used by the WAL-corruption repros.
// We only need:
//   - A scrape source that produces enough series to grow the WAL and trigger
//     checkpoints on the schedule below.
//   - A remote_write to the local stub.
// The truncation knobs are tuned aggressively so a checkpoint exists within
// ~30 seconds of startup (default truncate_frequency is 2h).

prometheus.scrape "self" {
  targets         = [{"__address__" = "127.0.0.1:12345", "job" = "alloy-self"}]
  forward_to      = [prometheus.remote_write.stub.receiver]
  scrape_interval = "1s"
  scrape_timeout  = "500ms"
}

prometheus.remote_write "stub" {
  endpoint {
    url = "http://stub:8080/api/v1/push"
  }

  wal {
    truncate_frequency = "30s"
    min_keepalive_time = "10s"
    max_keepalive_time = "1m"
  }
}
```

### Logs

```text
T+10 s  samples_out=66429   wal_appended=67637   watcher_cur_seg=7
  T+20 s  samples_out=69751   wal_appended=70657   watcher_cur_seg=7
  T+30 s  samples_out=70959   wal_appended=73677   watcher_cur_seg=7
  T+45 s  samples_out=70959   wal_appended=78207   watcher_cur_seg=7
  T+60 s  samples_out=70959   wal_appended=82737   watcher_cur_seg=7

# trying to restart
Container state: restarting restarts=6

Startup error from alloy logs:
  Error: /etc/alloy/config.alloy:16:1: Failed to build component:
         building component: get segment range: segments are not sequential
  Error: could not perform the initial load successfully
```

### Tip

<sub>React with 👍 if this issue is important to you.</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent watcher stall + crash-loop on missing WAL segment in prometheus.remote.write #6165

Component(s)

What's wrong?

Steps to reproduce

Finding 1: Deleting the current WAL segment causes a silent watcher stall

Finding 2: A restart cannot recover; component crash-loops

System information

Software version

Configuration

Logs

Tip

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Silent watcher stall + crash-loop on missing WAL segment in prometheus.remote.write #6165

Description

Component(s)

What's wrong?

Steps to reproduce

Finding 1: Deleting the current WAL segment causes a silent watcher stall

Finding 2: A restart cannot recover; component crash-loops

System information

Software version

Configuration

Logs

Tip

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions