Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions docs/design/5039-picking-a-prometheus-WAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Proposal: Picking a Prometheus WAL

* Author(s): Kyle Eckhart (@kgeckhart)
* Last updated: 2025-12-09
* Original issue: https://github.com/grafana/alloy/issues/5039

## Abstract

As of today there are two write-ahead-log (WAL) implementations for prometheus metrics in Alloy

[prometheus.remote_write](https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.remote_write/) (referred to as remote_write)
- Largely based on upstream prometheus concepts in the “[head_WAL](https://github.com/prometheus/prometheus/blob/main/tsdb/head_wal.go)
- Perceived as inheriting concepts that are relevant for a database, IE caring about “active series”, which is an implementation tradeoff rather than a database concept that was inherited
- One of the most used components in Alloy

[prometheus.write.queue](https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.write.queue/) (referred to as write.queue)
- Built to solve issues related to remote write such as,
- Reduce memory overhead as “active series” increase
- Add ability to replay data on startup
- Custom implementation based on https://github.com/grafana/walqueue/ with no links to upstream prometheus
- Currently, an experimental component with low usage

## Problem

As Alloy continues to get closer and closer to the Open Telemetry (OTel) Collector we need to ensure we have a clear and concise story around our various unique capabilities. The OTel Collector also has two WAL implementations,

* [WAL for all signals](https://opentelemetry.io/docs/collector/resiliency/#persistent-storage-write-ahead-log---wal): sending_queue with the file_storage extension
* Optional WAL for the [prometheus remote write exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/prometheusremotewriteexporter#getting-started)

The intention was that we try to upstream WALQueue and it would become the basis for prometheus and OTel.
As of today, upstreaming conversations have not started.
Realistically, convincing OTel to accept a third WAL implementation is unlikely, and we will face similar challenges for the Prometheus ecosystem.
There are other non-trivial investments required for write.queue including,
* Solving scaling issues encountered with implementing Prometheus Remote Write v2
* Investing enough to feel comfortable transitioning the component from experimental -> public preview -> GA
* Deciding on a migration strategy / determining how best to wrap remote_write to use write.queue under the hood

**Goals:**

Primary: Reduce the amount of WAL implementations the Alloy team is responsible for
Primary: Avoid a migration for customers currently using the remote_write component
Secondary: Gauge how the OTel WAL implementations compare to Alloy’s
* Having this data will allow us to write up pros and cons for choosing a “prometheus native” pipeline vs an “OTel native” pipeline with metrics
* It also will give us a lot of data we can use to benefit both ecosystems via improvements to the OTel collector WAL implementations and/or the prometheus remote write exporter WAL

## Proposal 0: Do Nothing

We continue to provide support for multiple WAL implementations increasing toil.

This is not a viable options and accomplishes none of the goals.

## Proposal 1: Improve remote_write to be the WAL we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to make it more clear what the accepted solution is. One way to do it is if Proposal 1 becomes the entire "Proposal" section, and the rest could be listed in "Alternative Solutions" similarly to an earlier doc.

Copy link
Contributor Author

@kgeckhart kgeckhart Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess I was more trying to directly translate our internal design doc process here. We do have two very valid proposals and based the discussion I would want to add a "conclusion" to have the PR merged. I can see the suggested format being useful when there's a single viable option but otherwise it feels a bit "leading" to present what is likely the authors choice as the "Proposal" and everything else as "Alternates".


There are some big sticking points with remote_write
1. Data can be removed from the WAL before it is sent
1. Resource utilization spikes incredibly high on startup and spikes during normal usage
1. Memory scales in a non-linear fashion as the number of series grows
1. Cannot replay data on restart

All of these are resolvable issues with some upstream work or on unique alloy capabilities such as the LabelStore. Many were done in a PoC fashion during a Grafana Labs hackathon.

As a part of this we would fully move to wrapping the upstream prometheus components instead of maintaining our current soft-fork.

**Pros**
* It is already upstream in prometheus and improvements we make can be upstreamed
* Customers already use it
* Moving to 100% upstream will require lower maintenance burden over time

**Cons**
* We can improve memory overhead but we might not get as memory efficient as write.queue
* Total cost of ownership should be a wash in the end as write.queue is more CPU heavy.
* Since we are dependent on upstream decisions some of the more complex efforts, like replay, could take more time

## Proposal 2: Invest in write.queue as our WAL

As an Alloy WAL implementation, write.queue solves the main sticking points for remote_write with a scaling profile that’s more predictable as the number of series increases and can accomplish replay*.

* The caveat with the implementation and replay is that it’s not “transactional”. As soon as the data is read to be distributed for delivery it is removed from disk leaving a hole where if data fails to be delivered it will be lost. This is a similar challenge that remote_write faces, it is viewed as a requirement for a replay option in remote_write. Dropping this requirement dramatically reduces the complexity of a replay solution.

There are some concerns though regarding the level of investment to feel comfortable transitioning the component from experimental -> public preview and further on the work it would take to migrate our existing remote_write customers.

**Pros**

* More consistent resource utilization + replay* OOTB

**Cons**

* Designed with Prometheus remote write v1 in mind causing friction when implementing Prometheus remote write v2
* Unless we can upstream it, it’s 100% on us to maintain
* We need to consider how to migrate existing customers off of remote_write

## Compatibility

Changes proposed in Proposal 1 will be backwards compatible. Proposal 2 would either require breaking changes or deciding how best to wrap `prometheus.write.queue` to replace the existing `prometheus.remote_write` component.

## Implementation

1. Data can be removed from the WAL before it is sent: https://github.com/prometheus/prometheus/issues/17616 is the upstream proposal
1. Resource utilization spikes incredibly high on startup and spikes during normal usage: https://github.com/prometheus/prometheus/issues/17617 is the upstream proposal
1. Memory scales in a non-linear fashion as the number of series grows
* There's an upstream proposal, https://github.com/prometheus/prometheus/issues/17619, to experiment to eliminate some dual caching of data between the WAL + queue_manager for metadata. If that is successful, we would do the same experiment with series reducing a lot of complexity/resources required for queue_manager to keep up to date with what series are active in the WAL
* The second part of this is eliminating the overhead of LabelStore. A proposal which simplifies when/how LabelStore is used for prometheus will be created shortly.
1. Cannot replay data on restart: Conversations have been started to get agreed upon requirements which will guide the first implementation but nothing to link to formally yet.

At the same time we would formally deprecate the `prometheus.write.queue` component in the next release.
Loading