Skip to content

CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335

Merged
driftx merged 1 commit intomain-5.0from
CNDB-16710
Apr 21, 2026
Merged

CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335
driftx merged 1 commit intomain-5.0from
CNDB-16710

Conversation

@driftx
Copy link
Copy Markdown

@driftx driftx commented Apr 20, 2026

When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation.

This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards.

Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans.

The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics.

Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.

When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.

This change also allows to remove one estimation method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.

Additionally, the accuracy of estimating of NOT_EQ rows has been
improved by letting the planner know the union generated by NOT_EQ
is disjoint so the result set cardinality is the sum of cardinalities
of the subplans.

The commit contains also a fix for a bug that caused some
non-hybrid queries be counted as hybrid by the query metrics.

Unused keyRange parameters have been removed from the methods
for estimating row counts in the index classes.
@github-actions
Copy link
Copy Markdown

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@driftx
Copy link
Copy Markdown
Author

driftx commented Apr 20, 2026

Adjusted imports. I think this broke SingleRestrictionEstimatedRowCountTest but it wasn't immediately obvious how.

@sonarqubecloud
Copy link
Copy Markdown

@cassci-bot
Copy link
Copy Markdown

❌ Build ds-cassandra-pr-gate/PR-2335 rejected by Butler


5 regressions found
See build details here


Found 5 new test failures

Test Explanation Runs Upstream
o.a.c.auth.CassandraRoleManagerTest.testPasswordUpdateRateLimitingPerRole (compression) REGRESSION 🔴 0 / 30
o.a.c.index.sai.cql.EstimatedRowCountTest.testReturnedRowsEstimates[numShards=64, numPartitions=10,000] (compression) NEW 🔴 0 / 30
o.a.c.index.sai.cql.VectorCompaction100dTest.testPQRefine[dc false] () NEW 🔴 0 / 30
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[db false] () NEW 🔴 0 / 30
o.a.c.index.sai.plan.SingleRestrictionEstimatedRowCountTest.testMemtablesSAI (compression) REGRESSION 🔴 0 / 30

Found 1 known test failures

@djatnieks
Copy link
Copy Markdown
Member

Adjusted imports. I think this broke SingleRestrictionEstimatedRowCountTest but it wasn't immediately obvious how.

We can see if the later commit from CNDB-17275 will help.

@driftx driftx merged commit 7324aef into main-5.0 Apr 21, 2026
5 of 7 checks passed
@driftx driftx deleted the CNDB-16710 branch April 21, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants