CNDB-16363: Improve matched rows estimation accuracy for memory indexes by driftx · Pull Request #2335 · datastax/cassandra

driftx · 2026-04-20T18:31:34Z

When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation.

This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards.

Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans.

The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics.

Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.

When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one shard to estimate the number of matched rows, and extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error. This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation. This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards. Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans. The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics. Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.

github-actions · 2026-04-20T18:31:52Z

driftx · 2026-04-20T18:33:29Z

Adjusted imports. I think this broke SingleRestrictionEstimatedRowCountTest but it wasn't immediately obvious how.

sonarqubecloud · 2026-04-20T19:36:51Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2026-04-20T19:40:46Z

❌ Build ds-cassandra-pr-gate/PR-2335 rejected by Butler

5 regressions found
See build details here

Found 5 new test failures

Test	Explanation	Runs	Upstream
o.a.c.auth.CassandraRoleManagerTest.testPasswordUpdateRateLimitingPerRole (compression)	REGRESSION	🔴	0 / 30
o.a.c.index.sai.cql.EstimatedRowCountTest.testReturnedRowsEstimates[numShards=64, numPartitions=10,000] (compression)	NEW	🔴	0 / 30
o.a.c.index.sai.cql.VectorCompaction100dTest.testPQRefine[dc false] ()	NEW	🔴	0 / 30
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[db false] ()	NEW	🔴	0 / 30
o.a.c.index.sai.plan.SingleRestrictionEstimatedRowCountTest.testMemtablesSAI (compression)	REGRESSION	🔴	0 / 30

Found 1 known test failures

djatnieks · 2026-04-21T01:15:02Z

Adjusted imports. I think this broke SingleRestrictionEstimatedRowCountTest but it wasn't immediately obvious how.

We can see if the later commit from CNDB-17275 will help.

djatnieks approved these changes Apr 21, 2026

View reviewed changes

driftx merged commit 7324aef into main-5.0 Apr 21, 2026
5 of 7 checks passed

driftx deleted the CNDB-16710 branch April 21, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335

CNDB-16363: Improve matched rows estimation accuracy for memory indexes#2335
driftx merged 1 commit intomain-5.0from
CNDB-16710

driftx commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

driftx commented Apr 20, 2026

Uh oh!

sonarqubecloud Bot commented Apr 20, 2026

Uh oh!

cassci-bot commented Apr 20, 2026

Uh oh!

djatnieks commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

driftx commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Checklist before you submit for review

Uh oh!

driftx commented Apr 20, 2026

Uh oh!

sonarqubecloud Bot commented Apr 20, 2026

Quality Gate passed

Uh oh!

cassci-bot commented Apr 20, 2026

❌ Build ds-cassandra-pr-gate/PR-2335 rejected by Butler

Found 5 new test failures

Found 1 known test failures

Uh oh!

djatnieks commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants