-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Is your proposal related to a problem?
Thanos Sidecar uses extremely large amounts of RAM when processing Prometheus queries that select high-cardinality data sets (#7395), even when the data set is mostly filtered and discarded.
Part of the cause of this behaviour is that Sidecar now buffers entire Prometheus remote-read responses in memory and re-sorts them (#6706) to work around a Prometheus bug with external labels (prometheus/prometheus#12605).
If re-sorting was not required, Sidecar would be able to process stream responses for many selectors, drastically reducing its memory use for some classes of query.
Worse, since Sidecar typically runs in a separate container, it may have separate QoS memory allocations. These should ideally be able to be much lower than Prometheus proper, but at present Sidecar may use almost unbounded memory depending on the user query it processes.
Getting rid of the unnecessary buffering and sorting would not wholly resolve these issues because Sidecar is still using its own PromQL executor and may have to accumulate large data-sets for label matching and sorting, but it'll have a huge impact for cases where the query response can be streamed.
Describe the solution you'd like
If the administrator is able to assert that there can be no conflict between external_labels
injected by Prometheus into the response, and labels in the TSDB from scrape targets or rewrite rules, then no re-sorting is required (see #6706 (comment)).
I propose a Sidecar configuration option to allow the admin to assert this, and have Sidecar stream responses instead of buffering and re-sorting them.
For example, if the administrator has ensured that all targets have relabel rules that drop the external label if it's found to be present, and has similar checks on all recording rules, there's no need for Thanos to wastefully buffer and sort.
It could potentially check series ordering and abort queries with an error if out-of-order series are detected during ingestion. It cannot recover and switch to buffering in this case, since out-of-order samples have already been processed, but at least it could detect the failure mode instead of silently producing incorrect results.
Describe alternatives you've considered
See conversation in prometheus/prometheus#12605 (comment) for approaches examined to possibly fixing this on the Prometheus end. Nothing clear and conclusive.
I've also looked at possibly adding an external-sort and on-disk buffering to Thanos Sidecar as an alternative. It's unclear how practical this is, as it looks like the sidecar uses a simple array/slice representation for series in at least some places, which may not be simple to efficiently abstract to an out-of-line storage that can overflow to disk.