CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335
Merged
CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335
Conversation
When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one shard to estimate the number of matched rows, and extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error. This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation. This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards. Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans. The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics. Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.
Checklist before you submit for review
|
Author
|
Adjusted imports. I think this broke SingleRestrictionEstimatedRowCountTest but it wasn't immediately obvious how. |
|
❌ Build ds-cassandra-pr-gate/PR-2335 rejected by Butler5 regressions found Found 5 new test failures
Found 1 known test failures |
Member
We can see if the later commit from CNDB-17275 will help. |
djatnieks
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error.
This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation.
This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards.
Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans.
The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics.
Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.