Skip to content

Add sidecar flag bypass Prometheus response buffering and re-sorting #8487

@ringerc

Description

@ringerc

Is your proposal related to a problem?

Thanos Sidecar uses extremely large amounts of RAM when processing Prometheus queries that select high-cardinality data sets (#7395), even when the data set is mostly filtered and discarded.

Part of the cause of this behaviour is that Sidecar now buffers entire Prometheus remote-read responses in memory and re-sorts them (#6706) to work around a Prometheus bug with external labels (prometheus/prometheus#12605).

If re-sorting was not required, Sidecar would be able to process stream responses for many selectors, drastically reducing its memory use for some classes of query.

Worse, since Sidecar typically runs in a separate container, it may have separate QoS memory allocations. These should ideally be able to be much lower than Prometheus proper, but at present Sidecar may use almost unbounded memory depending on the user query it processes.

Getting rid of the unnecessary buffering and sorting would not wholly resolve these issues because Sidecar is still using its own PromQL executor and may have to accumulate large data-sets for label matching and sorting, but it'll have a huge impact for cases where the query response can be streamed.

Describe the solution you'd like

If the administrator is able to assert that there can be no conflict between external_labels injected by Prometheus into the response, and labels in the TSDB from scrape targets or rewrite rules, then no re-sorting is required (see #6706 (comment)).

I propose a Sidecar configuration option to allow the admin to assert this, and have Sidecar stream responses instead of buffering and re-sorting them.

For example, if the administrator has ensured that all targets have relabel rules that drop the external label if it's found to be present, and has similar checks on all recording rules, there's no need for Thanos to wastefully buffer and sort.

It could potentially check series ordering and abort queries with an error if out-of-order series are detected during ingestion. It cannot recover and switch to buffering in this case, since out-of-order samples have already been processed, but at least it could detect the failure mode instead of silently producing incorrect results.

Describe alternatives you've considered

See conversation in prometheus/prometheus#12605 (comment) for approaches examined to possibly fixing this on the Prometheus end. Nothing clear and conclusive.

I've also looked at possibly adding an external-sort and on-disk buffering to Thanos Sidecar as an alternative. It's unclear how practical this is, as it looks like the sidecar uses a simple array/slice representation for series in at least some places, which may not be simple to efficiently abstract to an out-of-line storage that can overflow to disk.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions