Skip to content

Conversation

@dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Dec 8, 2025

Summary

Allow regex selectivity estimation to use actual sample label values instead of hardcoded 10% heuristics. This improves index lookup planning accuracy for regex matchers.

Changes

  • EstimateSelectivity() now accepts a sampleValues []string parameter
  • When sample values are provided, selectivity is computed by testing the regex against them; otherwise falls back to existing 10% heuristic
  • Computed selectivity is cached atomically to avoid recomputation

related to grafana/mimir#13782


Note

EstimateSelectivity now accepts sample label values to empirically compute regex selectivity (with caching); tests updated accordingly.

  • Labels matching/selectivity:
    • Update labels.Matcher.EstimateSelectivity(totalLabelValues, sampleValues) to use sampleValues for complex regex selectivity; falls back to 0.1 when none.
    • Add estimateComplexRegexSelectivity() and matchesN() in model/labels/cost.go.
    • Clamp and invert logic preserved; existing static fast paths unchanged.
  • Regex matcher internals (model/labels/regexp.go):
    • Add cached selectivity field FastRegexMatcher.estimatedSelectivity (atomic float), initialized to -1.
    • Wire cache into selectivity estimation path.
  • Tests (model/labels/cost_test.go):
    • Adjust calls to new EstimateSelectivity(..., nil).
    • Add tests for sample-driven selectivity and caching behavior.

Written by Cursor Bugbot for commit f40a7ba. This will update automatically on new commits. Configure here.

Add sampleValues parameter to EstimateSelectivity() to enable better regex
selectivity estimation based on actual label values instead of hardcoded
heuristics.

- Add estimatedSelectivity field to FastRegexMatcher for caching
- Add matchesN helper method to Matcher for counting matches
- Update EstimateSelectivity to use sample values when available
- Cache sample-based selectivity to avoid recomputation
// Cache the computed selectivity
m.re.estimatedSelectivity.Store(selectivity)
return selectivity
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cached selectivity shared incorrectly across different labels

The estimatedSelectivity is cached on FastRegexMatcher, which is shared via global cache across all Matcher instances using the same regex pattern, regardless of label name. When two matchers like label_a=~"pattern" and label_b=~"pattern" are created, they share the same FastRegexMatcher. If EstimateSelectivity is called for label_a with its specific sample values, that selectivity is cached and incorrectly returned for label_b even though label_b may have completely different value distributions. The cache granularity is wrong - selectivity depends on label-specific sample values but is cached at the regex-pattern level.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn, you're good. and right. this is a problem

@dimitarvdimitrov dimitarvdimitrov marked this pull request as draft December 8, 2025 19:02
@dimitarvdimitrov
Copy link
Contributor Author

in draft because cursor found a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant