wip: splitting and caching instant queries for functions over range operators #13472

fionaliao · 2025-11-11T19:14:11Z

What this PR does

Rough implementation for splitting and caching partial parts of instant queries for functions over range vectors. Currently only works for sum_over_time() (though it doesn't do Kahan summation properly) and does not work for subqueries.

Initially I tried getting something working for both range and instant queries. I got something to pass tests (see these commits), but the code got very complex, so I simplified to just instant queries for now.

This isn't in a state to be merged, the PR is just opened to show the current progress and allow for initial comments

Overview:

Added query splitting optimizer (ref) which goes through the nodes. If it finds a sum_over_time() function with a range vector selector that can be split with at least one split being cacheable, it'll create a new SplittableFunctionCall node over the FunctionCall node.
Materializing SplittableFunctionCall creates a newly introduced FunctionOverRangeVectorSplit operator ref. This should give the same results as FunctionOverRangeVector, but with the query splitting and caching logic.
- The split operator has a reference to the inner range vector selector node (instead of inner materialized operator).
- Most computation happens in SeriesMetadata() functions at the moment. I don't think SeriesMetadata is the best place to do all the work, it was just the simplest thing to implement for now.
- SeriesMetadata() will load all the cache results to see what gaps there are. Uncached splits are merged if they're contiguous, and a RangeVectorSelector operator is materialized for each split. When calling SeriesMetadata() for each uncached split, loading samples and calculating the intermediate result is done (this part could be deferred later).
- The FunctionOverRangeVectorSplit keeps track of a series -> split mapping. When NextSeries() is called, it just moves to the next series in the mapping, gets the results for each split and merges them.
- The current code assumes series from the inner operator aren't sorted lexicographically. For range vector selectors, we can assume sorted I think, but possibly not with subqueries.
- Storing results back into the cache is done in FunctionOverRangeVectorSplit.Finalize()
The splits are calculated based on the query time range at the moment. To be more effective, they should be calculated based on time ranges of the blocks loaded from storage (i.e. take into account offset and @ modifier)
Query splitting tests here. They all pass. Also ran Mimir locally in ingest storage mode and logs show splitting and caching is happening.
This implementation is not very memory efficient at the moment - all the intermediate results are held in memory until Finalize() is called for FunctionOverRangeVectorSplit. This includes the seriesmetadata for each split.

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]. If changelog entry is not needed, please add the changelog-not-needed label to the PR.
about-versioning.md updated with experimental features.

Co-authored-by: Bryan Boreham <[email protected]>

…steps

fionaliao · 2025-11-12T18:56:46Z

Updated to track memory consumption for series metadata in 82f5024 (tracker was previously erroring) and added a few more TODOs 😅

pkg/querier/querier.go

charleskorn · 2025-11-13T03:04:51Z

pkg/streamingpromql/planning.go

 	if opts.EnableNarrowBinarySelectors {
 		planner.RegisterQueryPlanOptimizationPass(plan.NewNarrowSelectorsOptimizationPass(opts.Logger))
 	}

+	// TODO: figure out how query splitting iteracts with other optimisation passes
+	if opts.InstantQuerySplitting.Enabled {


If it's possible, I'd recommend running this before CSE - CSE is the point at which the query plan can change from a tree to a DAG, so all optimisation passes after CSE need to be aware of this which adds extra complexity.

Moved in 3026f30

charleskorn · 2025-11-13T03:10:17Z

pkg/streamingpromql/planning/plan.go

+// CacheKey generates a unique cache key for a planning node for use with intermediate result caching.
+// Currently only supports MatrixSelector nodes (range vector selectors).
+// For other node types, this function panics.
+func CacheKey(node Node) string {


I imagine this would become a method on the Node interface.

I've refactored to make a SplittableNode interface with a QuerySplittingCacheKey() method for now instead of a method on Node, since at the moment, only MatrixSelector supports query splitting. Also this PR is getting quite large so I want to avoid making changes across all nodes if possible. commit: d09f20e

If we extend to support subqueries, we can move the method to Node since all nodes will have to be able to return a cache key at that point.

When looking into the cache key more, one concern I have was that if new fields are added to the node that might affect the results being returned. As an example - right now we need to include the SkipHistogramBuckets field in the cache key since if that's true then histograms will have empty buckets. It's possible newer optimisation passes will do something similar. There's not an automatic way of updating the cache key, people have to remember to update the QuerySplittingCacheKey() method.

I was considering just serializing the node protobuf (plus its children's protobuf) and using that as the cache key, but that includes the ExpressionPosition field which can vary if multiple queries have overlapping time ranges/splits that can be cached.

I've refactored to make a SplittableNode interface with a QuerySplittingCacheKey() method for now instead of a method on Node, since at the moment, only MatrixSelector supports query splitting. Also this PR is getting quite large so I want to avoid making changes across all nodes if possible. commit: d09f20e

If we extend to support subqueries, we can move the method to Node since all nodes will have to be able to return a cache key at that point.

Sounds good to me.

When looking into the cache key more, one concern I have was that if new fields are added to the node that might affect the results being returned. As an example - right now we need to include the SkipHistogramBuckets field in the cache key since if that's true then histograms will have empty buckets. It's possible newer optimisation passes will do something similar. There's not an automatic way of updating the cache key, people have to remember to update the QuerySplittingCacheKey() method.

I was considering just serializing the node protobuf (plus its children's protobuf) and using that as the cache key, but that includes the ExpressionPosition field which can vary if multiple queries have overlapping time ranges/splits that can be cached.

EquivalentToIgnoringHintsAndChildren is another method that has a similar problem - it's very easy to forget when adding a new field. Ditto for MergeHints. One thing I've long wondered but never got round to implementing is some kind of code generation for these methods, so new fields aren't missed - maybe this is the thing that would force us to do this?

We could likely also generate Child, ChildCount, SetChildren, ReplaceChildren and possibly ChildrenLabels as well.

charleskorn · 2025-11-13T03:12:21Z

pkg/streamingpromql/planning/plan.go

+	// hacky way to get the planning nodes to the operator
+	PlanningNodes []Node


Have a look at how remote execution handles this - it might provide some inspiration for what to do here.

I'm imagining the materializer would pass the nodes and the Materializer to the operator, and then it can decide (in Prepare) whether or not to materialize the nodes based on whether or not there are cache hits for that time range.

The PlanningNodes field wasn't actually being used, it was a renmant from a previous version that should be removed. I have updated the materialization process to be a bit more like the remote exec one though in b6a77ff - so having a new query splitting materializer, and the materializer now the reference to the cache rather than passing it as an operator parameter, since only the splittable function nodes need the cache.

pkg/streamingpromql/optimize/plan/querysplitting/optimization_pass.go

pkg/streamingpromql/planning/materialize.go

pkg/streamingpromql/optimize/plan/querysplitting/node.proto

charleskorn · 2025-11-13T03:34:06Z

pkg/streamingpromql/operators/functions/factories.go

@@ -118,6 +118,7 @@ func FunctionOverRangeVectorOperatorFactory(
 	name string,
 	f FunctionOverRangeVectorDefinition,
 ) FunctionOperatorFactory {
+	f.Name = name


Why is this needed?

Do you mean why we need to store the function name? It forms part of the cache key - we need the function name alongside the inner node cache key (and time range) since the intermediate result is specifically for the function rather than just the inner node.

We could set the function name directly in the function definition rather than on each operator factory call though.

We could set the function name directly in the function definition rather than on each operator factory call though.

This makes sense to me.

Another option could be using the Function value from operators/functions/functions.proto, as this will just be an integer rather than a string, which might save a little space in the key and be cheaper to hash.

charleskorn · 2025-11-13T03:39:12Z

pkg/streamingpromql/operators/functions/split_operator.go

+}
+
+func (m *FunctionOverRangeVectorSplit) Prepare(ctx context.Context, params *types.PrepareParams) error {
+	return nil


We'll need to call Prepare on all the child operators here - so some of the logic in SeriesMetadata will need to move here.

Does this mean we need to materialize the child operators here? I.e. in Prepare(), we will need to query the cache to decide what splits should be cached or uncached and materialize range vector operators for the uncached portions?

This could make it tricky to work with the cached data if we use memcached:

If we load all cached results for all splits and keep them in memory from Prepare() onwards, it can take up a lot of memory

An alternative is to query the cache multiple times for the same keys (e.g. Prepare() checks cache for what keys are present, SeriesMetadata() reads cache data). this may fail if entries are evicted between calls though. I don't think there's a way in memcached to avoid this.

I've moved creating splits into Prepare() and also calling Prepare() on child operators here (38c1bdf), though this currently means all cached results are completely loaded within Prepare()

I think if we can figure out a way to stream the cache entries out of Memcached, then that mitigates the issue of loading everything upfront.

charleskorn · 2025-11-13T03:46:24Z

pkg/streamingpromql/operators/functions/function_over_range_vector_split.go

+
+func (m *FunctionOverRangeVectorSplit) materializeOperatorForTimeRange(start int64, end int64) (types.RangeVectorOperator, error) {
+	subRange := time.Duration(end-start) * time.Millisecond
+	subNode := m.innerNode.CreateNodeForSubRange(subRange)


Rather than creating a new node like this, what if we passed this overriding range to ConvertNodeToOperator?

Then, when we're materializing the node, we use that when creating the operator (if the range is set).

Most node materializers would ignore this, but range vector selectors and subqueries could adjust the range on the created operator to match.

One possible wrinkle: for an expression like max_over_time(foo[12h]) - min_over_time(foo[12h]) as an instant query, CSE will identify the foo[12h] as common and introduce a Duplicate node, so we'd likely also need to handle this there. If we change the key for Materializer.operatorFactories to include the time range and overridden range, then we'll get different operators for each corresponding split range, so things would then work fine. With this in place, we'll benefit from CSE if both functions have identical uncached splits (which would be the common case), and if only one function has a given uncached split, then it'll still behave correctly.

That makes sense, but does this mean that we need to run the query splitting optimisation pass after CSE? So disregard this comment: #13472 (comment)

If we didn't merge uncached splits then we could run query splitting before CSE and still get the deduplication done but but then there's the issue of having a lot more calls to ingesters/store-gateways if all splits are uncached

The query splitting optimisation pass can still run before CSE - materialization happens after planning is finished (or even later in Prepare).

Then the uncached splits can still be merged, and any duplicate expressions with the same uncached range(s) can still be deduplicated.

Updated as per suggestion in 1730442 - TestQuerySplitting_WithCSE checks that storage is only called once per deduped split (Code needs to be refactored a bit though)

…pass.go Co-authored-by: Charles Korn <[email protected]>

pkg/streamingpromql/planning/core/vector_selector.go

Includes big refactor of moving operator into querysplitting package and pre-caculating ranges. Also fixes split ranges so they will align with block boundaries.

fionaliao · 2025-11-28T18:51:37Z

Since the initial PR was put up, the major updates have been:

Getting query splitting to work in conjunction with CSE
Improved polymorphism - FunctionOverRangeVectorSplit is now generic. Each splittable function now defines its own SplittableOperatorFactory, including its combine and generate functions and also how to serialize/deserialize its results from the cache. The cache entry proto now just has results with bytes as its type and separate protos are defined for each result type.
Additional functions implemented: (count|min|max)_over_time and rate/increase. Testing for these do need to be more comprehensive though.
Adjusting the time ranges queried before to align splits with each other and also with block boundaries better, accounting for the offset and @ modifers. See these comments and also these ones.

Next steps

Add stats around cache entry size and series per metadata and then run with some real workloads to see how well it works and how excessively it uses memory, then iterating on cache entry format.
Testing histograms
Handling cases where query splitting is worse (e.g. too big to fit in cache entry). Initial version can just be to cache "problematic" queries and fall back to non-query splitting instead.
Probably some renaming
Not caching/reading from the cache for ranges in the OOO window
Annotation handling (possibly in a later PR)
Subquery support (in later PRs)
Other function support (in later PRs)

This PR itself will be split up into more PRs before being set as ready for review.

fionaliao and others added 16 commits November 11, 2025 18:24

Add cache and sum over time splitting

5bace8b

Co-authored-by: Bryan Boreham <[email protected]>

Start integrating cache

970467b

Co-authored-by: Bryan Boreham <[email protected]>

Rough implementation of nextseries

2e0b840

Start testing

b99ca91

Tests pass

da87495

Wire up cache via engine

e2b3b90

Add range query tests

8daac53

Add optimisation pass and separate split operator

54b8c6d

Stop creating new ring buffers

3852775

Refactor by looking at ir blocks requested rather than recalculating …

0112ea2

…steps

Simplify to instant queries only

3192bb9

Improving tests

c7e74d0

Make cache work in local demo

57eb4f1

Use filesystem in dev to test with generated blocks

57de49e

Cleanup

c52be0d

Fix build

e5ecd19

fionaliao changed the title ~~Add cache and sum over time splitting~~ wip: splitting and caching instant queries for functions over range operators Nov 11, 2025

Ensure seriesmetadata memory is tracked

82f5024

Improve comments

3ea29b9

charleskorn reviewed Nov 13, 2025

View reviewed changes

fionaliao and others added 9 commits November 13, 2025 18:09

Add method to expand metadata slice returned from the pool

ab743f0

Move cache init into streamingpromql package

d2b1214

Minor cleanup

0902517

Fix test post cache init move

3f5e0eb

Introduce query splitting materializer

b6a77ff

Update pkg/streamingpromql/optimize/plan/querysplitting/optimization_…

1b1911d

…pass.go Co-authored-by: Charles Korn <[email protected]>

Merge branch 'main' into intermediate-cache-new

5b2cbc7

Use google.protobuf.Duration for splitDurationMs

59b00c1

make format-protobuf

f100964

fionaliao added 4 commits November 19, 2025 18:05

Add CSE tests (not all passing yet)

f7b2961

Rough working implementation with cse

1730442

Add tests in prep for storage time alignment

0aa6d45

Add QuerySplittingCacheKey() method

d09f20e

charleskorn reviewed Nov 20, 2025

View reviewed changes

pkg/streamingpromql/planning/core/vector_selector.go Outdated Show resolved Hide resolved

fionaliao added 4 commits November 22, 2025 00:04

Implement offset and @ adjustments

14cb530

Includes big refactor of moving operator into querysplitting package and pre-caculating ranges. Also fixes split ranges so they will align with block boundaries.

Introduce generics for polymorphism

d20ae99

Add a few more function implementations and refactor

8564d7a

Implement rate/increase

c3f4923

fionaliao force-pushed the intermediate-cache-new branch from 6d70d9c to c3f4923 Compare November 28, 2025 16:24

fionaliao added 5 commits November 28, 2025 16:25

Remove unused method

b359f1f

Simplify time range params

c0610ab

Build fixes

f14cbbd

Use new slice from pool when appending and need more capacity

2d6593a

Clean up

86ffbe1

fionaliao force-pushed the intermediate-cache-new branch from c3eaf9f to 86ffbe1 Compare November 28, 2025 18:40

Remove version field (is part of cache key anyway)

fe7c664

fionaliao force-pushed the intermediate-cache-new branch from ce0a143 to 90ad2bd Compare December 11, 2025 11:30

Add more metrics and logging

a61a9bd

fionaliao force-pushed the intermediate-cache-new branch from 90ad2bd to a61a9bd Compare December 11, 2025 11:44

fionaliao added 9 commits December 11, 2025 12:42

Merge branch 'main' into intermediate-cache-new

cc61c18

Don't allow smoothed and anchored modifiers to be split for now

d09cf2b

Use reg param

368b791

Add more stats

eafeb13

Fix result getter infinite loop

fab2df0

Improve logging

267ca09

Fix rate combine when a middle split has no samples

dc73094

Only run finalize if all actions succeeded

da99f81

Add supported query plan version properly

c6da4b2

		// hacky way to get the planning nodes to the operator
		PlanningNodes []Node

wip: splitting and caching instant queries for functions over range operators #13472

Are you sure you want to change the base?

wip: splitting and caching instant queries for functions over range operators #13472

Uh oh!

Conversation

fionaliao commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Checklist

Uh oh!

fionaliao commented Nov 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fionaliao commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fionaliao commented Nov 11, 2025 •

edited

Loading

fionaliao commented Nov 28, 2025 •

edited

Loading