Skip to content

prometheus_remote_write sink: metrics are lost during endpoint downtime #23990

@jpovixwm

Description

@jpovixwm

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I'm running Victoria Metrics on my desktop PC, and Vector on my home router. The router's on 24/7, but the PC goes to sleep after a period of inactivity. While the PC is suspended, I expect Vector to buffer metrics to disk, and re-send them when the PC gets back online.
However, when the PC wakes up, I can only see a gap in the Victoria Metrics data corresponding to the time the PC was suspended, and the gap is never filled with data, as far as I can tell.

To aid in debugging this issue, I've devised a minimal reproduction case:

  1. Run Victoria Metrics: docker run --rm -p 127.0.0.1:8429:8428 victoriametrics/victoria-metrics:v1.127.0 (I'm binding to 8429 on the host because 8428 is already occupied by my real Victoria Metrics instance)
  2. Run a reverse proxy for Victoria Metrics, which will be configured as the prometheus_remote_write endpoint in Vector: docker run --rm --net=host caddy caddy reverse-proxy --from :1234 --to :8429 (a reverse proxy is used so that the downtime of Victoria Metrics can be easily simulated by killing/restarting the reverse proxy, while keeping Victoria Metrics alive and accessible at http://127.0.0.1:8429, so that you can observe the ingested metrics in real time)
  3. Run vector with the configuration shown below.
  4. Observe metric data for the query static_dummy_gauge in http://127.0.0.1:8429/vmui/
  5. Kill the reverse proxy container to simulate downtime.
  6. Wait some time and restart the reverse proxy.
  7. The expected behavior is for the gap to be eventually filled with data, but this never happens.

Configuration

data_dir: ./vector-data

sources:
  static_metrics:
    type: static_metrics
    interval_secs: 10
    metrics:
      - kind: absolute
        name: dummy_gauge
        tags:
          source: static_metrics
        value:
          gauge:
            value: 0

transforms:
  dummy_gauge:
    type: lua
    inputs:
      - static_metrics
    version: "2"
    hooks:
      process: |-
        function (event, emit)
          event.metric.gauge.value = os.time() % 60
          emit(event)
        end

sinks:
  emit_victoria_metrics:
    type: prometheus_remote_write
    inputs:
      - dummy_gauge
    endpoint: "http://localhost:1234/api/v1/write"
    buffer:
      type: "disk"
      max_size: 268435488

Version

vector 0.50.0 (x86_64-pc-windows-msvc 9053198 2025-09-23 14:18:50.944442940)

Debug Output

https://gist.github.com/jpovixwm/4698e6eb48bc8173dc0fd1adb77c64d6

Example Data

No response

Additional Context

The debug log was obtained by following the reproduction steps provided above. The reverse proxy container was killed at 2025-10-11T14:51:07.438Z, and restarted at 2025-10-11T14:52:07.428Z.
Curiously, Vector did backfill some of the missing data (3 data points) after the reverse proxy was restarted:

Image Image Those 3 data points are the first 3 that were produced after the reverse proxy was killed. From what I was able to observe, this behavior is not dependent on the duration of the downtime - Vector seems to correctly backfill only three samples.

References

#21410

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions