[exporter/prometheusremotewriteexporter] Retry transient WAL export errors and add configurable segment cache size#49383
Open
charanck9 wants to merge 1 commit into
Conversation
…e_size and restart backoff
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…e_size and restart backoff
Description
Transient errors caused data loss. When a WAL export failed, the error was treated the same regardless of cause. Transient backend failures (5xx, network errors)
were not retried against the buffered WAL data the way permanent failures were — so data that should have been redelivered once the backend recovered was effectively
dropped.
Tight retry loop on persistent WAL errors. When continuallyPopWALThenExport returned an error, run() immediately restarted the WAL with no delay. If the error was
persistent, this spun in a hot loop, burning CPU and flooding logs.
Unbounded WAL memory usage / no way to cap it. The WAL kept buffer_size segments cached in memory with no way to lower it. On large backlogs this drove high memory
consumption with no tuning knob.
#49334
Fixes
Classify and retry transient errors. exportThenFrontTruncateWAL now retries the export indefinitely (with a 5s wait, cancellable via context/stop) until the
backend recovers, instead of dropping the data. Permanent errors (e.g. 4xx, detected via consumererror.IsPermanent) are skipped and truncated since retrying can't
help. While retrying, no new data is read from the WAL, which also bounds memory growth.
Backoff before WAL restart. Added a 5s backoff (cancellable via context/stopChan) before run() restarts the WAL after a processing error, preventing the tight
retry loop.
Configurable segment_cache_size. New option (default = buffer_size) controlling how many WAL segments are cached in memory. Lower values cut memory usage at the
cost of extra disk reads during replay; set to 2 for a minimal footprint. Documented in the README.
Testing
Documentation
Authorship