-
Notifications
You must be signed in to change notification settings - Fork 10
labels: improve regex selectivity estimation with sample values #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
labels: improve regex selectivity estimation with sample values #1051
Conversation
Add sampleValues parameter to EstimateSelectivity() to enable better regex selectivity estimation based on actual label values instead of hardcoded heuristics. - Add estimatedSelectivity field to FastRegexMatcher for caching - Add matchesN helper method to Matcher for counting matches - Update EstimateSelectivity to use sample values when available - Cache sample-based selectivity to avoid recomputation
| // Cache the computed selectivity | ||
| m.re.estimatedSelectivity.Store(selectivity) | ||
| return selectivity | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Cached selectivity shared incorrectly across different labels
The estimatedSelectivity is cached on FastRegexMatcher, which is shared via global cache across all Matcher instances using the same regex pattern, regardless of label name. When two matchers like label_a=~"pattern" and label_b=~"pattern" are created, they share the same FastRegexMatcher. If EstimateSelectivity is called for label_a with its specific sample values, that selectivity is cached and incorrectly returned for label_b even though label_b may have completely different value distributions. The cache granularity is wrong - selectivity depends on label-specific sample values but is cached at the regex-pattern level.
Additional Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn, you're good. and right. this is a problem
|
in draft because cursor found a bug |
Summary
Allow regex selectivity estimation to use actual sample label values instead of hardcoded 10% heuristics. This improves index lookup planning accuracy for regex matchers.
Changes
EstimateSelectivity()now accepts asampleValues []stringparameterrelated to grafana/mimir#13782
Note
EstimateSelectivity now accepts sample label values to empirically compute regex selectivity (with caching); tests updated accordingly.
labels.Matcher.EstimateSelectivity(totalLabelValues, sampleValues)to usesampleValuesfor complex regex selectivity; falls back to 0.1 when none.estimateComplexRegexSelectivity()andmatchesN()inmodel/labels/cost.go.model/labels/regexp.go):FastRegexMatcher.estimatedSelectivity(atomic float), initialized to-1.model/labels/cost_test.go):EstimateSelectivity(..., nil).Written by Cursor Bugbot for commit f40a7ba. This will update automatically on new commits. Configure here.