Skip to content

Conversation

@dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Dec 8, 2025

Changes are based on #13781

In draft because, i'm still not sure if this is the best way to approach this

Summary

Collects a sample of label values during statistics generation to enable better regex selectivity estimation.

This PR:

  • Adds SampleValues method to Statistics
  • Collects 1% sample of label values per label name (capped at 1024 values or 64KB)
  • Uses deterministic sampling for reproducibility
  • Configurable max values, sample rate, max sampled size

Related to #13782

Copy the Statistics interface from index.Statistics into the lookupplan
package to decouple query planning from the Prometheus TSDB index package.
Add SampleValues method to the Statistics interface and implement sample
collection in BlockStatistics. This enables selectivity estimation of
regex matchers by running them against a representative sample of actual
label values.

Sample collection uses:
- 1% sampling probability
- Max 1024 samples per label
- Max 64KB total bytes per label
- Deterministic seeding for reproducibility
Tests verify:
- Sample values are collected with ~1% probability
- Sampling is deterministic (same input -> same output)
- Non-existent labels return nil
- Small datasets behave correctly
Add configurable parameters for sample values collection:
- SampleValuesProbability (default 1%)
- SampleValuesMaxCount (default 1024)
- SampleValuesMaxBytes (default 64KB)

Also refactor Stats() to take CostConfig instead of separate
threshold parameters.
Document that sampled values are held in memory and that setting
any of the sample values parameters to 0 disables sampling (falling
back to 10% selectivity estimation for regexes).
Move string sampling logic from Stats() into a dedicated StringsSampler
struct with MaybeSample() and Sampled() methods. This allows for more
focused unit testing of the sampling behavior.
@dimitarvdimitrov dimitarvdimitrov added the changelog-not-needed PRs that don't need a CHANGELOG.md entry label Dec 8, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2025

Base automatically changed from dimitar/lookupplan/define-local-statistics-interface to main December 8, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog-not-needed PRs that don't need a CHANGELOG.md entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant