Skip to content

Conversation

@afhassan
Copy link
Contributor

This PR explores using a series-batching optimization to reduce peak RingBuffer memory usage for queries with long lookback windows and high cardinality.

Problem: Queries like sum(increase(metric[24h])) with a small step size create relatively large ring buffer for matrix selectors. This is done once per series and can quickly bloat memory causing OOM for high cardinality queries.

Solution: When matrix selector window overlap exceeds a certain number of steps (example 100), the optimizer switches from step batching to series batching:

  • Processes all steps for a small batch of series (default: 1000)
  • Reuse a fixed pool of ring buffers for all series batches
  • This reduces peak memory significantly and avoids OOM

Changes:

  • Added EnableHighOverlapBatching option to engine
  • Extended SelectorBatchSize optimizer to detect high overlap queries
    • If detected, adjust StepsBatch to cover all steps and set a series batch size instead
  • Implemented BufferPool for ring buffer reuse across series batches
  • Added Clear() method to all buffer types for proper state reset before reuse

This PR is work in progress... The main blocker is validating that series batching can work for certain functions/operators that need all series present for execution

// If any aggregate is present in the plan, the batch size is set to the configured value.
// The two exceptions where this cannot be done is if the aggregate is quantile, or
// when a binary expression precedes the aggregate.
func (m SelectorBatchSize) Optimize(plan Node, _ *query.Options) (Node, annotations.Annotations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure this will even reach into ringbuffers, right now this only applies to vectors in direct aggregations i think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants