Open
Conversation
This implements one of the possible fixes for riptano/cndb#12407. This PR skips loading the PrimaryKey's token in the case where the sstables/memtables do not overlap, which is particularly helpful as datasets become more compacted (especially after major compaction). It is implemented via a subtle change to the `PrimaryKeyWithSource` class that only loads the token info when the token is needed. We avoid this check by first checking to see if the sstable ranges overlap or if the key is contained in the sstable range. If it is not contained, we can short circuit the logic and avoid loading the primary key from disk. This results in a significant optimization for SAI hybrid queries that search-then-sort. CNDB PR with passing tests: riptano/cndb#12444 - [x] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [x] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [x] Verify test results on Butler - [x] Test coverage for new/modified code is > 80% - [x] Proper code formatting - [x] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [x] Each commit has a meaningful description - [x] Each commit is not very long and contains related changes - [x] Renames, moves and reformatting are in distinct commits
HCD requires custom authenticators enabled via a cqlsh plugin. Backports [CASSANDRA-16456](https://issues.apache.org/jira/browse/CASSANDRA-16456) to add support for `cqlsh` plugins. Adds `datastax_db_*-VERSION.zip` to the list of automatically loaded plugins (used by HCD only). --------- Co-authored-by: Bhouse99 <bhouse99@protonmail.com> Co-authored-by: Stefan Miklosovic <smiklosovic@apache.org>
Fix MAX_SEGMENT_SIZE < chunkSize in MmappedRegions::updateState Opportunistically, fixes some leaks in a test.
…hreshold or expiration period. Default disabled. (#1724)
… skip slowest replica if configured (#1734)
…t with different address (#1666) - add `IEndpointSnitch#filterByAffinityForWrite` and rename `filterByAffinity` to `filterByAffinityForReads`
- **Add failing test** - **CNDB-14153: Fix SAI updates (non-null solution)** ### What is the issue Fixes riptano/cndb#14153 ### What does this PR fix and why was it fixed This is meant as an alternative to #1749. It fixes riptano/cndb#14153 by never returning `null` from the `UpsertTransformer`. #1749 is a more memory efficient solution, but has additional complexity, which is why I am proposing this as an alternative.
### What is the issue Fixes riptano/cndb#14171 ### What does this PR fix and why was it fixed #1200 introduced a bug for SAI indexes version AA that have clustering columns. As the tests show, updates incorrectly removed rows from the index. We need the update logic for later versions of SAI, so it is key to keep the update feature, but AA does not support those features precisely because it only indexes the partition key, so this is a safe update.
Fixes: riptano/cndb#14160 The loop is supposed to loop until the deadline, not after the deadline. The test fails without the change.
This commit changes the approximate calculation of the average length of documents with queried terms to the calculation of the average length of all documents in a segment. The average length uses number of rows/documents and total number of terms in the documents. This PR changes obtaining this numbers from calculating them per query execution to calculating them during flushing and compacting and storing them in metadata. Thus the disk format is updated to new version 8 and `ED`. As the result new average length is applied from version `ED`, while older versions use the previous way of calculating it. The tests are added. This commit reduces code duplication related to BM25 sorting in TrieMemtableIndex. In the affected code explicit types replace vars in declarations, since they are prohibited in Apache Cassandra, see CASSANDRA-20389. Also few IDEA warnings are fixed in affected files: typos in comments and code simplification by removing unnecessary string builder.
…ytesCounted (#1735) Modify CQL counters calculation of rows data size to include cell deletions even after the row has been purged. That way, the counters will count the same size in bytes for filtered and unfiltered base partition/row iterators. This solves a bug where byte-based paging is wrongly considering replicas as exhausted if their responses contain tombstones. This leads to queries using byte-based paging returning fewer rows than expected. It includes all aggregation queries, where byte-based paging is always used internally. The problem is that replicas apply counters to unfiltered iterators, whereas the coordinator controls paging by applying counters to reconciled and filtered iterators. Co-authored-by: Andrés de la Peña <a.penya.garcia@gmail.com> Co-authored-by: shunsaker <shunsaker@users.noreply.github.com>
…ore we run tests. The patch can be tested on InvertedIndexSearcherTest, for example.
… unloaded after tenant unassignment (#1766) When schema is unloaded after tenant unassignment, compaction task might finishes without corresponding index files, making index non-queryable. Replace `isValid` with `isDropped` and `isUnloaded`. If index is dropped, compaction task or index build can proceed without the index, same behavior as before. If index is unloaded, compaction task or index build will be aborted to avoid completing without index files. --- #1754 was reverted because[ CNDB PR](riptano/cndb#14179) compiled failed with wrong hash. Re-merge it again.
…ize (#1755) The data inserted into the trie in `TrieMemoryIndex` is `encodedTerm`, which is built from `term` and only on the "available" bytes (between `term.position()` and `term.limit()`). But the check that decides to use the (more efficient) recursive path or not uses `term.limit()` to assess the size of `encodedTerm`. If `term.position()` is not 0, this is incorrect, and can lead to using the less optimal pass completely unecessarily. This has been shown to happen when investigating riptano/cndb#14153: the non-recursive path was taken even for boolean values (because the nodes were using `offheap_buffers`; with `offheap_objects`, the buffers getting to `TriMemoryIndex` are 0-positioned). See riptano/cndb#14184.
Fixes: riptano/cndb#14167 We upgraded to jvector 4 too soon. We need to use jvector 2 for a release cycle and then when we upgrade next, we can go to jvector 4. We needed a two phase release. CNDB test PR: riptano/cndb#14196 Co-authored-by: Michael Marshall <michael.marshall@datastax.com>
…1763) - **CNDB-14210: Fix analyzed sai index on compound partition key column** ### What is the issue Fixes: riptano/cndb#14210 ### What does this PR fix and why was it fixed Fixes some queries that were broken by the march and may release. In #1434, we introduced some logic to help make the eq behavior better, and it incorrectly handled compound partition keys. This fixes that. The central fix is to use the `EQRestriction` any time we have a primary key column. This is necessary to ensure we can write and read data. The tests cover the relevant cases. I also fixed the error message returned when attempting to use `:` on a clustering column index. Please review the text of the error message.
Few indexes were created with execute method, which doesn't check if an index is ready. Changing it to createIndex fixes the observed flakiness.
…ion observer for composite compaction (#1767) CNDB: riptano/cndb#14203
Currently returning a null from an Upserter may corrupt the trie state which may end up in a serious problem. It is better to crash instead.
CNDB-13770 Separate timeout for aggregation queries In CC, aggregate user queries use the same range read timeout. In DSE, we had a separate timeout that defaulted to 120s. We would like to retain that functionality. This PR adds a separate 120s timeout for aggregate queries. The timeout is configurable with aggregation_request_timeout_in_ms Config parameter
…CNDB for EtcdSStable (#1848) ### What is the issue Fixes CNDB-14680 Tying to build CNDB using CC 5.04.0 results in compilation failures in uses of `SSTableIntervalTree.buildIntervals` because that method expects a collection of `SSTableReader` but CNDB uses `EtcdSSTable` instead. ### What does this PR fix and why was it fixed CNDB uses `EctdSSTable` in place of `SSTableReader` and CNDB uses of `SSTableIntervalTree.buildIntervals` get compile errors due to CC expecting `SSTableReader` params. There were CC changes made in STAR-13 and STAR-791 that replace `Interval<PartitionPosition, SSTableReader>` with `<S extends CompactionSSTable> Interval<PartitionPosition, S>` so that `EtcdSSTable`, which does implement `CompactionSSTable` can be used in place of `SSTableRead`. These changes didn't get applied during the C* 5.0.4 rebase, likely because a) C* 5.0 already contains much/most of the changes that were made in STAR-13/STAR-791 for CC 4.0, and b) CC code itself does not require these changes to compile or run - they are intended for CNDB, and went unnoticed until now.
…f the DSE-compatibility flag (#1725) When using `use_dse_compatible_histogram_boundaries` flag we by mistake use DSE DecayingEstimatedHistogramReservoir bucket boundaries for `EstimatedHistogram`. This may lead to unexpected and unhandled histogram overflow for extremely large partitions. This fix makes `EstimatedHistogram` use, by default, the same bucket boundaries as in upstream Cassandra and DSE `EstimatedHistogram`
For riptano/cndb#14123, we want to be able to catch issues happening during the opening of just-flushed sstables to use shallow sstables. This commit enable this by adding a method to `StorageHandler` that is called on such issue, and allow to provide a "replacement" `SSTableReader` instance.
…bles (#1762) ### What is the issue This test doesn't disable compaction and doesn't retain a reference to the sstables, so it can run with an unexpected amount of sstables and also race with the removal of the sstables backing the ReducingKeyIterator, which causes a variety of memory safety issues. This test fails approximately 1/30 times when multiplexed in CI. ### What does this PR fix and why was it fixed This fixes several issues. First, it disables compaction after the schema is created. Second, it fulfills the contract of ReducingKeyIteratorTest by taking references to the sstables, which should be superfluous with compactions disabled, but I prefer testing the contract. Third, it verifies that nothing is changing the number of sstables in the system (which would have caught the previous compaction misconfiguration).
Prefer not-analyzed indexes over analyzed indexes for contains queries, so they have a deterministic behaviour. Also, emit a client warning when a not-analyzed index is selected over an analyzed index. Otherwise, different points in the codebase will make different, pseudo-random decisions about what index should be used for a certain contains expression, leading to erratic behaviour.
…s additional obverser on top of UCS (#1783) ### What is the issue UCS didn't clear pending compaction tasks in `BackgroundCompactions#compactions` for parallel background compaction ### What does this PR fix and why was it fixed Register both UCS and composite compaction observer for parallel compaction task: both UCS and CNDB are notified
…ystem property (#1785) Memtable shard lock (required for `put`) is non-fair. We suspect this leads to elevated latencies in case of bursty load, as in #13565 This change introduces `cassandra.trie.memtable.shard.lock.fairness system` property and `LockFairness` property of `org.apache.cassandra.db:type=TrieMemtableConfig` JMX object to configure it persistently or on-line. The on-line change is effective once a new memtable is created (i.e. after flush). If forcing a flush is not desired, one can watch `BytesFlushed` metric for the table
### What is the issue When reading SSTables containing dropped columns with tuple types (or UDTs containing tuples), the column ordering is being corrupted during bitmap deserialization. ### What does this PR fix and why was it fixed Fixes dropped column handling with User-Defined Types (UDTs). Column ordering depends on isComplex(), which depends on the type's isMultiCell(). When a dropped column's type had a different isMultiCell() value in the schema vs the SSTable, column ordering became incorrect, causing bitmap decode errors and data corruption.
### What is the issue Fixes #16793 CC5 doesn't understand CC4 system tables and generates a new host_id on upgrade ### What does this PR fix and why was it fixed Reads CC4's file-based node metadata stored in MessagePack format and converts them to CC5 LocalInfo and PeerInfo objects on first boot.
…void JUnit test timeouts. (#2121) ### What is the issue CNDB-14023, ForceRepairTest fails sometimes in CI with a JUnit test timeout. ### What does this PR fix and why was it fixed Adds a timeout while waiting for node to be marked down instead of waiting indefinitely and raising a JUnit timeout.
### What is the issue UCS settings files are not dropped after the table gets dropped. Instead they are supposed to be cleared after the node restart. The cleanup is faulty though and it prevents the node from startup. Root Cause: The cleanupControllerConfig() method in CompactionManager attempts to verify if a table exists by calling getColumnFamilyStore(). When the table is dropped, this method throws IllegalArgumentException, which was not being caught. The existing catch block only handled NullPointerException (for missing keyspace). ### What does this PR fix and why was it fixed Extended the exception handler to catch both NullPointerException and IllegalArgumentException, allowing orphaned controller-config.JSON files to be properly identified and deleted during node restart. 5.0 counterpart of #2145.
Adds storage-compatibility guards for pre-5.0 mode around auth/role and system schema behavior, defers incompatible changes to NONE
### What is the issue This file is missing causing the loadCommitLogAndSSTablesWithDroppedColumnTestCC50 test to fail ### What does this PR fix and why was it fixed Adds the missing file.
### What is the issue CIDR authz is a 5.0 feature ### What does this PR fix and why was it fixed Gates CIDR setup by storage compatibility mode
### What is the issue TrieMemtableMetricsTest's byteman rule has the wrong target method ### What does this PR fix and why was it fixed Correctly uses the 'apply' method
### What does this PR fix and why was it fixed Adds a new artifact alongside the existing one.
…2261) ### What is the issue CNDB-17010 ### What does this PR fix and why was it fixed CC4 stored the memtable column in system_schema.tables as frozen<map<text, text>>, while CC5 uses text. During upgrades, binary-serialized map data is misinterpreted as UTF-8 text, causing memtable configurations to fall back to defaults.
### What is the issue CVE-2024-12798, CVE-2024-12801, CVE-2025-11226, CVE-2026-1225 ### What does this PR fix and why was it fixed Upgrades logback to 1.5.25
### What is the issue There are security advisories present for jackson 2.18.4 ### What does this PR fix and why was it fixed Upgrades jackson-core, jackson-databind, and jackson-annotations to 2.18.6
Checklist before you submit for review
|
CNDB-17333: Create separate physical artifacts for db-all Maven coordinate
Update build.xml to create separate physical artifacts for db-all Maven
coordinate (com.datastax.db:db-all) alongside existing dse-db-all artifacts.
### What is the issue
CNDB-17333: Support publishing both db-all and dse-db-all artifacts
### What does this PR fix and why was it fixed
Creates separate physical `db-all` artifacts alongside existing `dse-db-all` artifacts to support gradual migration to new Maven coordinates.
**Changes to build.xml:**
1. **artifacts target** - Creates separate db-all artifacts:
- Copies `db-all` JARs and POMs from `dse-db` artifacts
- Creates separate `db-all-{version}-bin.tar.gz` from `${dist.dir}`
- Creates separate `db-all-{version}-src.tar.gz` from `${basedir}`
- Generates SHA-256 and SHA-512 checksums for all db-all tarballs
2. **publish target** - Signs both artifact sets:
- Signs `dse-db-all` tarballs (existing)
- Signs `db-all` tarballs (new)
**Artifacts created (example for version 5.0.4.0):**
- `dse-db-all-5.0.4.0.jar` + POM + sources
- `dse-db-all-5.0.4.0-bin.tar.gz` + checksums
- `dse-db-all-5.0.4.0-src.tar.gz` + checksums
- `db-all-5.0.4.0.jar` + POM + sources (copied from dse-db)
- `db-all-5.0.4.0-bin.tar.gz` + checksums (separate physical file)
- `db-all-5.0.4.0-src.tar.gz` + checksums (separate physical file)
**Deployment:**
Works with jenkins-pipeline-lib PR #254 which deploys artifacts under both Maven coordinates:
- `com.datastax.dse:dse-db-all:{version}`
- `com.datastax.db:db-all:{version}`
**Related PR:** riptano/jenkins-pipeline-lib#254
Implements a row-level trie memtable that uses deletion-aware tries to store deletions separately from live data, together with the associated TrieBackedPartition and TriePartitionUpdate. Refactors trie hierarchy to support multiple trie types: - plain - range, which stores range boundaries and is able to answer questions about the range that applies to every point in the trie - deletion aware, which combines a data part and a deletion range trie Every trie type supports suitable operations, including merging and intersection that make sense for the type of trie. In particular, deletion-aware tries apply range branches to delete data during merges. Adds a new method to UnfilteredRowIterator that is implemented by the new trie-backed partitions to ask them to stop issuing tombstones. This is done on filtering (i.e. conversion from UnfilteredRowIterator to RowIterator) where tombstones have already done their job and are no longer needed. Adds JMH tests of tombstones that demonstrate tombstone-independent performance on memtable queries.
in a combined `encodedState` returned by advancing methods. This saves megamorphic calls to `incomingTransition` and can be augmented by further information at no cost.
This functionality has two main applications: - it allows reverse walks that present prefix content in the correct byte-comparable order (i.e. prefixes after children) - it makes it possible to have full control over what is and isn't included in a trie ranges (e.g. making it possible to have a branch set and nested ranges)
…and TrieMemtable to Stage3 version Remove duplicate configuration object and add tests for stage 3
This change extends the coverage of the memtable trie to the cell level, defining mappings of trie branches to and from the legacy concepts of complex columns and rows.
This makes it possible to have completely off-heap trie memtable, where cell data is stored inside the trie structure if it is small enough to fit, or placed in natively-allocated memory and referenced by memory address.
|
8f66239 to
c00cfa8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What is the issue
https://github.com/riptano/cndb/issues/15669
https://github.com/riptano/cndb/issues/10302
What does this PR fix and why was it fixed
Implementation of the fully off-heap, tombstone-aware memtable.
The first commit is CNDB-10302 as reviewed in #2005, adding tombstone support. The second refactors some of the access interfaces to combine the cursor position into a single long for efficiency and extra flexibility, which the third commit uses to lift some restrictions in the kinds of ranges that the tries could support. The fifth commit extends the memtable trie all the way to individual cells, and the sixth makes it possible to store data in trie cells. When used with
offheap_objectsallocation type, this memtable is fully off-heap, with ~100KiB of on-heap presence irrespective of data size.Each commit should compile and pass tests, and comes with documentation in the included markdown files.