Skip to content

CNDB-15669: Fully off-heap memtable#2308

Open
blambov wants to merge 1734 commits intomain-5.0from
CNDB-15669
Open

CNDB-15669: Fully off-heap memtable#2308
blambov wants to merge 1734 commits intomain-5.0from
CNDB-15669

Conversation

@blambov
Copy link
Copy Markdown

@blambov blambov commented Apr 7, 2026

What is the issue

https://github.com/riptano/cndb/issues/15669
https://github.com/riptano/cndb/issues/10302

What does this PR fix and why was it fixed

Implementation of the fully off-heap, tombstone-aware memtable.

The first commit is CNDB-10302 as reviewed in #2005, adding tombstone support. The second refactors some of the access interfaces to combine the cursor position into a single long for efficiency and extra flexibility, which the third commit uses to lift some restrictions in the kinds of ranges that the tries could support. The fifth commit extends the memtable trie all the way to individual cells, and the sixth makes it possible to store data in trie cells. When used with offheap_objects allocation type, this memtable is fully off-heap, with ~100KiB of on-heap presence irrespective of data size.

Each commit should compile and pass tests, and comes with documentation in the included markdown files.

michaeljmarshall and others added 30 commits June 13, 2025 11:53
This implements one of the possible fixes for
riptano/cndb#12407.

This PR skips loading the PrimaryKey's token in the case where the
sstables/memtables do not overlap, which is particularly helpful as
datasets become more compacted (especially after major compaction).

It is implemented via a subtle change to the `PrimaryKeyWithSource`
class that only loads the token info when the token is needed. We avoid
this check by first checking to see if the sstable ranges overlap or if
the key is contained in the sstable range. If it is not contained, we
can short circuit the logic and avoid loading the primary key from disk.
This results in a significant optimization for SAI hybrid queries that
search-then-sort.

CNDB PR with passing tests: riptano/cndb#12444

- [x] Make sure there is a PR in the CNDB project updating the Converged
Cassandra version
- [x] Use `NoSpamLogger` for log lines that may appear frequently in the
logs
- [x] Verify test results on Butler
- [x] Test coverage for new/modified code is > 80%
- [x] Proper code formatting
- [x] Proper title for each commit staring with the project-issue
number, like CNDB-1234
- [x] Each commit has a meaningful description
- [x] Each commit is not very long and contains related changes
- [x] Renames, moves and reformatting are in distinct commits
HCD requires custom authenticators enabled via a cqlsh plugin.

Backports [CASSANDRA-16456](https://issues.apache.org/jira/browse/CASSANDRA-16456)
to add support for `cqlsh` plugins.
Adds `datastax_db_*-VERSION.zip` to the list of automatically loaded plugins (used by HCD only).

---------

Co-authored-by: Bhouse99 <bhouse99@protonmail.com>
Co-authored-by: Stefan Miklosovic <smiklosovic@apache.org>
Fix MAX_SEGMENT_SIZE < chunkSize in MmappedRegions::updateState

Opportunistically, fixes some leaks in a test.
…hreshold or expiration period. Default disabled. (#1724)
…t with different address (#1666)

- add `IEndpointSnitch#filterByAffinityForWrite` and rename
`filterByAffinity` to `filterByAffinityForReads`
- **Add failing test**
- **CNDB-14153: Fix SAI updates (non-null solution)**

### What is the issue
Fixes riptano/cndb#14153

### What does this PR fix and why was it fixed
This is meant as an alternative to
#1749. It fixes
riptano/cndb#14153 by never returning `null`
from the `UpsertTransformer`.

#1749 is a more memory efficient solution, but has additional
complexity, which is why I am proposing this as an alternative.
### What is the issue
Fixes riptano/cndb#14171

### What does this PR fix and why was it fixed
#1200 introduced a bug for SAI
indexes version AA that have clustering columns. As the tests show,
updates incorrectly removed rows from the index.

We need the update logic for later versions of SAI, so it is key to keep
the update feature, but AA does not support those features precisely
because it only indexes the partition key, so this is a safe update.
Fixes: riptano/cndb#14160

The loop is supposed to loop until the deadline, not after the deadline.

The test fails without the change.
This commit changes the approximate calculation of the average length of
documents with queried terms to the calculation of the average length of
all documents in a segment.

The average length uses number of rows/documents and total number of
terms in the documents. This PR changes obtaining this numbers from
calculating them per query execution to calculating them during flushing
and compacting and storing them in metadata. Thus the disk format is
updated to new version 8 and `ED`. As the result new average length is
applied from version `ED`, while older versions use the previous way of
calculating it. The tests are added.

This commit reduces code duplication related to BM25 sorting in
TrieMemtableIndex.

In the affected code explicit types replace vars in declarations, since
they are prohibited in Apache Cassandra, see CASSANDRA-20389.
Also few IDEA warnings are fixed in affected files: typos in comments
and code simplification by removing unnecessary string builder.
…ytesCounted (#1735)

Modify CQL counters calculation of rows data size to include cell
deletions even after the row has been purged. That way, the counters
will count the same size in bytes for filtered and unfiltered base
partition/row iterators.

This solves a bug where byte-based paging is wrongly considering
replicas as exhausted if their responses contain tombstones. This leads
to queries using byte-based paging returning fewer rows than expected.
It includes all aggregation queries, where byte-based paging is always
used internally. The problem is that replicas apply counters to
unfiltered iterators, whereas the coordinator controls paging by
applying counters to reconciled and filtered iterators.

Co-authored-by: Andrés de la Peña <a.penya.garcia@gmail.com>
Co-authored-by: shunsaker <shunsaker@users.noreply.github.com>
…ore we run tests. The patch can be tested on InvertedIndexSearcherTest, for example.
… unloaded after tenant unassignment (#1766)

When schema is unloaded after tenant unassignment, compaction task might
finishes without corresponding index files, making index non-queryable.

Replace `isValid` with `isDropped` and `isUnloaded`. If index is
dropped, compaction task or index build can proceed without the index,
same behavior as before. If index is unloaded, compaction task or index
build will be aborted to avoid completing without index files.

---

#1754 was reverted because[
CNDB PR](riptano/cndb#14179) compiled failed
with wrong hash. Re-merge it again.
…ize (#1755)

The data inserted into the trie in `TrieMemoryIndex` is `encodedTerm`,
which is built from `term` and only on the "available" bytes (between
`term.position()` and `term.limit()`). But the check that decides to use
the (more efficient) recursive path or not uses `term.limit()` to assess
the size of `encodedTerm`. If `term.position()` is not 0, this is
incorrect, and can lead to using the less optimal pass completely
unecessarily. This has been shown to happen when investigating
riptano/cndb#14153: the non-recursive path was
taken even for boolean values (because the nodes were using
`offheap_buffers`; with `offheap_objects`, the buffers getting to
`TriMemoryIndex` are 0-positioned).

See riptano/cndb#14184.
Fixes: riptano/cndb#14167

We upgraded to jvector 4 too soon. We need to use jvector 2 for a
release cycle and then when we upgrade next, we can go to jvector 4. We
needed a two phase release.

CNDB test PR: riptano/cndb#14196

Co-authored-by: Michael Marshall <michael.marshall@datastax.com>
…1763)

- **CNDB-14210: Fix analyzed sai index on compound partition key
column**

### What is the issue
Fixes: riptano/cndb#14210

### What does this PR fix and why was it fixed
Fixes some queries that were broken by the march and may release. In
#1434, we introduced some logic to help make the eq behavior better, and
it incorrectly handled compound partition keys. This fixes that.

The central fix is to use the `EQRestriction` any time we have a primary
key column. This is necessary to ensure we can write and read data. The
tests cover the relevant cases. I also fixed the error message returned
when attempting to use `:` on a clustering column index. Please review
the text of the error message.
Few indexes were created with execute method, which doesn't check if
an index is ready. Changing it to createIndex fixes the observed
flakiness.
Currently returning a null from an Upserter may corrupt
the trie state which may end up in a serious problem.
It is better to crash instead.
CNDB-13770 Separate timeout for aggregation queries

In CC, aggregate user queries use the same range read timeout. In DSE,
we had a separate timeout that defaulted to 120s. We would like to
retain that functionality.

This PR adds a separate 120s timeout for aggregate queries. The timeout
is configurable with aggregation_request_timeout_in_ms Config parameter
…CNDB for EtcdSStable (#1848)

### What is the issue
Fixes CNDB-14680

Tying to build CNDB using CC 5.04.0 results in compilation failures in
uses of `SSTableIntervalTree.buildIntervals` because that method expects
a collection of `SSTableReader` but CNDB uses `EtcdSSTable` instead.

### What does this PR fix and why was it fixed
CNDB uses `EctdSSTable` in place of `SSTableReader` and CNDB uses of
`SSTableIntervalTree.buildIntervals` get compile errors due to CC
expecting `SSTableReader` params.

There were CC changes made in STAR-13 and STAR-791 that replace
`Interval<PartitionPosition, SSTableReader>` with `<S extends
CompactionSSTable> Interval<PartitionPosition, S>` so that
`EtcdSSTable`, which does implement `CompactionSSTable` can be used in
place of `SSTableRead`.

These changes didn't get applied during the C* 5.0.4 rebase, likely
because a) C* 5.0 already contains much/most of the changes that were
made in STAR-13/STAR-791 for CC 4.0, and b) CC code itself does not
require these changes to compile or run - they are intended for CNDB,
and went unnoticed until now.
…f the DSE-compatibility flag (#1725)

When using `use_dse_compatible_histogram_boundaries` flag we by mistake
use DSE DecayingEstimatedHistogramReservoir bucket boundaries for
`EstimatedHistogram`. This may lead to unexpected and unhandled
histogram overflow for extremely large partitions.

This fix makes `EstimatedHistogram` use, by default, the same bucket
boundaries as in upstream Cassandra and DSE `EstimatedHistogram`
For riptano/cndb#14123, we want to be able to
catch issues happening during the opening of just-flushed sstables to
use shallow sstables. This commit enable this by adding a method to
`StorageHandler` that is called on such issue, and allow to provide a
"replacement" `SSTableReader` instance.
…bles (#1762)

### What is the issue
This test doesn't disable compaction and doesn't retain a reference to
the sstables, so it can run with an unexpected amount of sstables and
also race with the removal of the sstables backing the
ReducingKeyIterator, which causes a variety of memory safety issues.
This test fails approximately 1/30 times when multiplexed in CI.

### What does this PR fix and why was it fixed
This fixes several issues. First, it disables compaction after the
schema is created. Second, it fulfills the contract of
ReducingKeyIteratorTest by taking references to the sstables, which
should be superfluous with compactions disabled, but I prefer testing
the contract. Third, it verifies that nothing is changing the number of
sstables in the system (which would have caught the previous compaction
misconfiguration).
Prefer not-analyzed indexes over analyzed indexes for contains queries,
so they have a deterministic behaviour.
Also, emit a client warning when a not-analyzed index is selected over
an analyzed index.

Otherwise, different points in the codebase will make different,
pseudo-random decisions about what index should be used for a certain
contains expression, leading to erratic behaviour.
…s additional obverser on top of UCS (#1783)

### What is the issue

UCS didn't clear pending compaction tasks in
`BackgroundCompactions#compactions` for parallel background compaction

### What does this PR fix and why was it fixed

Register both UCS and composite compaction observer for parallel
compaction task: both UCS and CNDB are notified
…ystem property (#1785)

Memtable shard lock (required for `put`) is non-fair. We suspect this
leads to elevated latencies in case of bursty load, as in #13565

This change introduces `cassandra.trie.memtable.shard.lock.fairness
system` property and `LockFairness` property
of `org.apache.cassandra.db:type=TrieMemtableConfig` JMX object to
configure it persistently or on-line.
The on-line change is effective once a new memtable is created (i.e.
after flush). If forcing a flush is not desired, one can watch
`BytesFlushed` metric for the table
driftx and others added 12 commits February 17, 2026 11:11
### What is the issue
When reading SSTables containing dropped columns with tuple types (or
UDTs containing tuples), the column ordering is being corrupted during
bitmap deserialization.

### What does this PR fix and why was it fixed
Fixes dropped column handling with User-Defined Types (UDTs). Column
ordering depends on isComplex(), which depends on the type's
isMultiCell(). When a dropped column's type had a different
isMultiCell() value in the schema vs the SSTable, column ordering became
incorrect, causing bitmap decode errors and data corruption.
### What is the issue

Fixes #16793

CC5 doesn't understand CC4 system tables and generates a new host_id on
upgrade

### What does this PR fix and why was it fixed

Reads CC4's file-based node metadata stored in MessagePack format and
converts them to CC5 LocalInfo and PeerInfo objects on first boot.
…void JUnit test timeouts. (#2121)

### What is the issue
CNDB-14023, ForceRepairTest fails sometimes in CI with a JUnit test
timeout.

### What does this PR fix and why was it fixed
Adds a timeout while waiting for node to be marked down instead of
waiting indefinitely and raising a JUnit timeout.
### What is the issue
UCS settings files are not dropped after the table gets dropped. Instead
they are supposed to be cleared after the node restart. The cleanup is
faulty though and it prevents the node from startup.

Root Cause:
The cleanupControllerConfig() method in CompactionManager attempts to
verify if a table exists by calling getColumnFamilyStore(). When the
table is dropped, this method throws IllegalArgumentException, which was
not being caught. The existing catch block only handled
NullPointerException (for missing keyspace).

### What does this PR fix and why was it fixed
Extended the exception handler to catch both NullPointerException and
IllegalArgumentException, allowing orphaned controller-config.JSON files
to be properly identified and deleted during node restart.

5.0 counterpart of #2145.
Adds storage-compatibility guards for pre-5.0 mode around auth/role and
system schema behavior, defers incompatible changes to NONE
### What is the issue

This file is missing causing the
loadCommitLogAndSSTablesWithDroppedColumnTestCC50 test to fail

### What does this PR fix and why was it fixed

Adds the missing file.
### What is the issue
CIDR authz is a 5.0 feature

### What does this PR fix and why was it fixed
Gates CIDR setup by storage compatibility mode
### What is the issue
TrieMemtableMetricsTest's byteman rule has the wrong target method

### What does this PR fix and why was it fixed
Correctly uses the 'apply' method
### What does this PR fix and why was it fixed

Adds a new artifact alongside the existing one.
…2261)

### What is the issue
CNDB-17010

### What does this PR fix and why was it fixed
CC4 stored the memtable column in system_schema.tables as
frozen<map<text, text>>, while CC5 uses text. During upgrades,
binary-serialized map data is misinterpreted as UTF-8 text, causing
memtable configurations to fall back to defaults.
### What is the issue
CVE-2024-12798, CVE-2024-12801, CVE-2025-11226, CVE-2026-1225

### What does this PR fix and why was it fixed
Upgrades logback to 1.5.25
### What is the issue
There are security advisories present for jackson 2.18.4

### What does this PR fix and why was it fixed
Upgrades jackson-core, jackson-databind, and jackson-annotations to
2.18.6
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

CNDB-17333: Create separate physical artifacts for db-all Maven coordinate

Update build.xml to create separate physical artifacts for db-all Maven
coordinate (com.datastax.db:db-all) alongside existing dse-db-all artifacts.

### What is the issue
CNDB-17333: Support publishing both db-all and dse-db-all artifacts

### What does this PR fix and why was it fixed

Creates separate physical `db-all` artifacts alongside existing `dse-db-all` artifacts to support gradual migration to new Maven coordinates.

**Changes to build.xml:**

1. **artifacts target** - Creates separate db-all artifacts:
   - Copies `db-all` JARs and POMs from `dse-db` artifacts
   - Creates separate `db-all-{version}-bin.tar.gz` from `${dist.dir}`
   - Creates separate `db-all-{version}-src.tar.gz` from `${basedir}`
   - Generates SHA-256 and SHA-512 checksums for all db-all tarballs

2. **publish target** - Signs both artifact sets:
   - Signs `dse-db-all` tarballs (existing)
   - Signs `db-all` tarballs (new)

**Artifacts created (example for version 5.0.4.0):**
- `dse-db-all-5.0.4.0.jar` + POM + sources
- `dse-db-all-5.0.4.0-bin.tar.gz` + checksums
- `dse-db-all-5.0.4.0-src.tar.gz` + checksums
- `db-all-5.0.4.0.jar` + POM + sources (copied from dse-db)
- `db-all-5.0.4.0-bin.tar.gz` + checksums (separate physical file)
- `db-all-5.0.4.0-src.tar.gz` + checksums (separate physical file)

**Deployment:**
Works with jenkins-pipeline-lib PR #254 which deploys artifacts under both Maven coordinates:
- `com.datastax.dse:dse-db-all:{version}`
- `com.datastax.db:db-all:{version}`

**Related PR:** riptano/jenkins-pipeline-lib#254
@lesnik2u lesnik2u self-requested a review April 8, 2026 13:47
blambov and others added 10 commits April 9, 2026 10:28
Implements a row-level trie memtable that uses deletion-aware
tries to store deletions separately from live data, together
with the associated TrieBackedPartition and TriePartitionUpdate.

Refactors trie hierarchy to support multiple trie types:
- plain
- range, which stores range boundaries and is able to answer
  questions about the range that applies to every point in the
  trie
- deletion aware, which combines a data part and a deletion range
  trie

Every trie type supports suitable operations, including merging
and intersection that make sense for the type of trie. In particular,
deletion-aware tries apply range branches to delete data during
merges.

Adds a new method to UnfilteredRowIterator that is implemented
by the new trie-backed partitions to ask them to stop issuing
tombstones. This is done on filtering (i.e. conversion from
UnfilteredRowIterator to RowIterator) where tombstones have already
done their job and are no longer needed.

Adds JMH tests of tombstones that demonstrate tombstone-independent
performance on memtable queries.
in a combined `encodedState` returned by advancing methods.
This saves megamorphic calls to `incomingTransition` and can
be augmented by further information at no cost.
This functionality has two main applications:
- it allows reverse walks that present prefix content in the correct
  byte-comparable order (i.e. prefixes after children)
- it makes it possible to have full control over what is and isn't
  included in a trie ranges (e.g. making it possible to have a branch
  set and nested ranges)
…and TrieMemtable to Stage3 version

Remove duplicate configuration object and add tests for stage 3
This change extends the coverage of the memtable trie to the
cell level, defining mappings of trie branches to and from the
legacy concepts of complex columns and rows.
This makes it possible to have completely off-heap trie memtable,
where cell data is stored inside the trie structure if it is small
enough to fit, or placed in natively-allocated memory and referenced
by memory address.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 9, 2026

@cassci-bot
Copy link
Copy Markdown

❌ Build ds-cassandra-pr-gate/PR-2308 rejected by Butler


653 regressions found
See build details here


Found 653 new test failures

Showing only first 15 new test failures

Test Explanation Runs Upstream
junit.framework.TestSuite.org.apache.cassandra.distributed.test.sai.datamodels.QueryRowDeletionsTest-_jdk11 REGRESSION 🔵🔴 0 / 30
junit.framework.TestSuite.org.apache.cassandra.distributed.test.sai.datamodels.QueryTimeToLiveTest-_jdk11 REGRESSION 🔵🔴 0 / 30
junit.framework.TestSuite.org.apache.cassandra.distributed.test.sai.datamodels.QueryWriteLifecycleTest-_jdk11 REGRESSION 🔵🔴 0 / 30
o.a.c.cql3.validation.entities.SecondaryIndexOnMapEntriesTest.testShouldRecognizeAlteredOrDeletedMapEntries (compression) REGRESSION 🔵🔴 0 / 30
o.a.c.cql3.validation.entities.SecondaryIndexOnStaticColumnTest.testIndexOnCollections (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.SecondaryIndexTest.testDeletions (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.SecondaryIndexTest.testUpdatesToMemtableData (compression) REGRESSION 🔵🔴 0 / 30
o.a.c.cql3.validation.entities.StaticColumnsTest.testStaticColumns (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFJavaTest.testJavaSimpleCollections (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFJavaTest.testJavaTupleTypeCollection (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFJavaTest.testJavaUTCollections (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFJavaTest.testJavaUserType (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFJavaTest.testJavaUserTypeWithUse (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.entities.UFTypesTest.testComplexNullValues (compression) REGRESSION 🔴🔴 0 / 30
o.a.c.cql3.validation.miscellaneous.TombstonesTest.initializationError (compression) NEW 🔴🔴 0 / 30

Found 22 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.