Skip to content

[exporter/prometheusremotewriteexporter] Retry transient WAL export errors and add configurable segment cache size#49383

Open
charanck9 wants to merge 1 commit into
open-telemetry:mainfrom
charanck9:fix/prw-wal-segment-cache
Open

[exporter/prometheusremotewriteexporter] Retry transient WAL export errors and add configurable segment cache size#49383
charanck9 wants to merge 1 commit into
open-telemetry:mainfrom
charanck9:fix/prw-wal-segment-cache

Conversation

@charanck9

@charanck9 charanck9 commented Jun 30, 2026

Copy link
Copy Markdown

…e_size and restart backoff

Description

  1. Transient errors caused data loss. When a WAL export failed, the error was treated the same regardless of cause. Transient backend failures (5xx, network errors)
    were not retried against the buffered WAL data the way permanent failures were — so data that should have been redelivered once the backend recovered was effectively
    dropped.

  2. Tight retry loop on persistent WAL errors. When continuallyPopWALThenExport returned an error, run() immediately restarted the WAL with no delay. If the error was
    persistent, this spun in a hot loop, burning CPU and flooding logs.

  3. Unbounded WAL memory usage / no way to cap it. The WAL kept buffer_size segments cached in memory with no way to lower it. On large backlogs this drove high memory
    consumption with no tuning knob.

#49334
Fixes

  1. Classify and retry transient errors. exportThenFrontTruncateWAL now retries the export indefinitely (with a 5s wait, cancellable via context/stop) until the
    backend recovers, instead of dropping the data. Permanent errors (e.g. 4xx, detected via consumererror.IsPermanent) are skipped and truncated since retrying can't
    help. While retrying, no new data is read from the WAL, which also bounds memory growth.

  2. Backoff before WAL restart. Added a 5s backoff (cancellable via context/stopChan) before run() restarts the WAL after a processing error, preventing the tight
    retry loop.

  3. Configurable segment_cache_size. New option (default = buffer_size) controlling how many WAL segments are cached in memory. Lower values cut memory usage at the
    cost of extra disk reads during replay; set to 2 for a minimal footprint. Documented in the README.

Testing

Documentation

Authorship

  • I, a human, wrote this pull request description myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants