CNDB-16593 main-5.0 5.0.6 rebase – DO NOT MERGE#2257
Closed
michaelsembwever wants to merge 1751 commits intomain-5.0from
Closed
CNDB-16593 main-5.0 5.0.6 rebase – DO NOT MERGE#2257michaelsembwever wants to merge 1751 commits intomain-5.0from
michaelsembwever wants to merge 1751 commits intomain-5.0from
Conversation
### What is the issue This is a follow up test for riptano/cndb#13696. ### What does this PR fix and why was it fixed The fix was already merged, but this confirms that without the fix, collections are affected by the NoSuchElementException bug. (Rebase of commit 88f08e2)
Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789 (Rebase of commit 14cbbc6)
### What is the issue Didn't create an issue as this is just a new test. ### What does this PR fix and why was it fixed We implicitly rely on this and are exposing this as a feature to our end users, so we want to confirm that the stop words are the expected stop words. (Rebase of commit 27ac572)
Fixes: riptano/cndb#13718 We had adopted this version of lucene datastax/lucene@5ea8bb4 in order to support our custom modifications of HNSW on top of lucene. We now use https://github.com/datastax/jvector for vector search and no longer need a custom build. I propose we use 9.8.0 since that is the closest release to the one we have been using. CNDB test pr: riptano/cndb#13757 (Rebase of commit 3d5d7d6)
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap). Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb` to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and `SortedStringTableCursor`, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it. Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with `-Dcassandra.read_prefetching_threads` (but default to the number of "processors"). The `-Dcassandra.read_prefetching_window` can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then prefetching is triggered. See riptano/cndb#13639 for additional motivation. I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE). One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity. In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size). --------- Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com> (Rebase of commit c57974a)
…intervals in Interval.AsymetricOrdering are Comparable and the SSTableReader mock in BaseCompactionStrategyTest violated the Comparator general contract, raising IllegalArgumentException: Comparison method violates its general contract! The root cause was that the objects being compared in Interval.minOrdering were not the same object using '==' but the i1.data.compareTo(i2.data) result was 0, thus violating the comparator contract. Copilot analysis was that Mockito mocks do not have a meaningful compareTo implementation, so their comparison is not stable or meaningful. This fix changes BaseCompactionStrategyTest.mockSSTable so that getId returns unique values and provides an explicit comparator that compares using getId. (Rebase of commit 390a5ce)
…DRA-20189, but due to rebase conflicts the implementation was overwritten by CC SAI code. (Rebase of commit ce68642)
…ASSANDRA-20100, but due to rebase conflicts the implementation was overwritten by CC SAI code. (Rebase of commit 655714f)
(Rebase of commit da9aab6)
(Rebase of commit d122c73)
…essions, changing default value of StorageAttachedIndexOptions.prioritize_over_legacy_index to 'true'. The prioritize_over_legacy_index was introduced in 5.0.4 by CASSANDRA-20334 using a default of 'false' (Rebase of commit d5896b9)
…1707) ### What is the issue Failing `EarlyOpenCachingTest` due to incorrect assignment of data file length. ### What does this PR fix and why was it fixed Passes the correct lengths to the early open sstable construction. This lets `SimpleSSTableScanner` correctly finish iteration. This was already fixed in CC (as part of CNDB-9104). (Rebase of commit 6e759fa)
…ild heap (#1693) ### What is the issue Fixes: riptano/cndb#13689 ### What does this PR fix and why was it fixed This PR utilizes the NodeQueue::pushMany method to decrease the time complexity required to build the NodeQueue from `O(n log(n))` to `O(n)`. This is likely only significant for sufficiently large hybrid queries. For example, we have seen cases of the search producing 400k rows, which means that we do 400k insertions into these NodeQueue objects. (Rebase of commit 0f7e07b)
Fixes riptano/cndb#13822 CNDB test pr riptano/cndb#13826 SAI’s DiskANN/JVector engine currently **always searches with pruning enabled**, trading recall for latency. That is fine for most workloads, but: * **Threshold / bounded ANN queries** can lose matches when pruning exits early. * Performance‑testing users need an easy way to turn pruning off to measure the recall/latency curve. This patch introduces a per‑query **`use_pruning`** option so users and operators can choose the trade‑off that suits them. --- ```cql WITH ann_options = {'use_pruning': true|false} ``` *When omitted we fall back to the node level default (see below).* * Cluster‑wide default is controlled by the JVM system property: ``` -Dcassandra.sai.jvector.use_pruning_default=<true|false> ``` exposed as `V3OnDiskFormat.JVECTOR_USE_PRUNING_DEFAULT`. * The property defaults to `true`, preserving existing behaviour. * Value must be the literal `true` or `false` (case‑insensitive). * Unknown ANN option keys continue to raise `InvalidRequestException`. * `Orderer` computes the value of the `usePruning` option by using the `use_pruning` value if it is not null or the jvm default if it is null and passes it down to **all** `graph.search()` calls. * Threshold / bounded ANN queries always pass `use_pruning = false` because correctness > latency for those paths (this is a net new change, but it's very minor and might not have any impact on those queries depending on the jvector implementation) * We added one flag bit to `ANNOptions` serialization; older nodes ignore unknown bits, so mixed‑version clusters are fine (though they do throw an exception for unknown settings) * Parsing, validation and transport round‑trips (`ANNOptionsTest`). * Distributed smoke (`ANNOptionsDistributedTest`). * Recall regression for pruning vs no‑pruning (`VectorSiftSmallTest.ensureDisablingPruningIncreasesRecall`). (Rebase of commit 23be2e7)
(Rebase of commit 47f2c2a)
…1705) ### What is the issue Fixes: riptano/cndb#13625 ### What does this PR fix and why was it fixed When `rerank_k` is non-positive, we will skip reranking where possible. A minor note: this change is not strictly backwards compatible because of the original implementation. However, it will fail quickly if the coordinator accepts the value but the writer does not, and since this is only a query-time parameter, I propose that we move forward and accept the net-new value that was previously disallowed. (Rebase of commit 717971d)
…ation (#1709) Fixes: riptano/cndb#13847 We already make use of the `Token#getLongValue` method in SAI, which only works for the Murmur partitioner, so I propose that we add a symmetric method that allows us to create tokens without converting a long to a byte buffer back to a long. (Rebase of commit 68a92da)
The number of rows stored in SSTable's SAI index was incorrect for analyzed indexes and for indexes on collections as it was counting posting lists for each term. In non-analyzed or non-collection indexes there is a term per row, while in analyzed index there can be many terms in a row. This led to incorrect count of rows. Fixes counting number of rows stored in SSTable SAI index. Minor fixes of code warnings in the affected files. (Rebase of commit 29ef34e)
The test for KD-tree posting list was meant to run a query that performs skips and then verifies if the skip metric has been bumped up. Unfortunately there were several problems with it: * the query didn't do skips because the clause intersection got optimized out by the SAI optimizer * the test for the metric didn't inspect the metric value, but only checked if the metric was updated; so essentially it boiled down to checking if the index was used * the metric value incorrectly recorded queries with no skips, because the histogram was configured to not allow zeroes. In addition, the usage of index is dependent on the version of the optimizer, so that made the test behave differently between DC and EC index version. Now that we disable the optimizer for this test, that difference is gone. Fixes riptano/cndb#13830 (Rebase of commit 17b4ad7)
The new request failure reason should be used when the on-disk format of an index in a replica doesn't support a requested feature. Also, rename org.apache.cassandra.index.sai.disk.format.Version.latest() to Version.current(), and add separate Version.CURRENT and Version.LATEST constants. (Rebase of commit 523435a)
(Rebase of commit dce2e0c)
(Rebase of commit 895ade3)
…clustering columns in descending order - port CASSANDRA-20295 (Rebase of commit f10abed)
The fix for CNDB-13591 was developed and only committed to the March 25 release branch commit e3dfead (HEAD -> cndb-13591-main, origin/cndb-13591-main) Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Date: Fri Apr 4 16:14:24 2025 +0200 CNDB-13591: Fix unit tests (missed commit in previous hotfix) (cherry picked from commit 7c51303) commit b1f732e Author: Enrico Olivelli <enrico.olivelli@datastax.com> Date: Fri Apr 4 14:18:30 2025 +0200 CNDB-13591: Error while loading chunkcache: java.lang.IllegalArgumentException: Negative position (#1677) Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com> (cherry picked from commit 09748bd) commit 4ff4879 Author: Branimir Lambov <branimir.lambov@datastax.com> Date: Fri Apr 4 14:42:26 2025 +0300 CNDB-13591: Go back to storing byte buffers in the cache (cherry picked from commit 7624a19) --------- Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com> Co-authored-by: Jacek Lewandowski <lewandowski.jacek@gmail.com> (Rebase of commit 8b21a2d)
…iew (#1700) Fixes: riptano/cndb#13693 The new design is to: 1. Make sai `View` reference-able so that it holds references to the underlying `SSTableIndex`s. When the `view` is released the final time, it releases the indexes. This moves all of the complexity of grabbing references to sstable update time and out of the query path, which seems like a generally good improvement. 2. Observe that `SSTableIndex` holds a reference to its associated sstable reader. Note also that we need to create the memtable index on deletions in order to make the index-first solution work. Also note that sstables indexes are added before they are removed on flush, so this change in design is safe for flush. (Rebase of commit c42a2de)
(Rebase of commit 34ce5d9)
…#2176) ### What is the issue After CNDB-16249, hints are written with the storage-compatible messaging version (e.g., VERSION_40), but dispatch was still using the peer's negotiated version (VERSION_DS_20). This mismatch forced hints to be decoded and re-encoded on dispatch, which fails when the tenant is unassigned because tables are unknown. ### What does this PR fix and why was it fixed Cap the dispatch version to the storage compatibility mode's version so hints can use the encoded path (no deserialization) when the versions match. (Rebase of commit 7d8f28b)
…compatibility mode (#2179) ### What is the issue Fix version mismatch error when dispatching encoded hints in storage compatibility mode ### What does this PR fix and why was it fixed Remove version checks in HintMessage.Encoded serialization that prevented dispatching hints written at storage compatibility version (e.g., 12) when the peer connection uses a different messaging version (e.g., 110). UUID and VInt serialization are version-independent, and hint bytes are written verbatim, so the version check was unnecessarily restrictive. (Rebase of commit c36fbcf)
ULID-based SSTable ID generation can fail with an NPE when generating a new ID. The root cause is that the underlying ULID generator can generate an empty Optional when the clock is moved backwards to before the previously generated ID or in certain rare overflow conditions when timestamp collides. If it's our first time through the generation loop, we prematurely exit with a null newVal. Top of the error stack: ``` java.lang.NullPointerException at org.apache.cassandra.utils.TimeUUID.approximateFromULID(TimeUUID.java:58) at org.apache.cassandra.io.sstable.ULIDBasedSSTableId.<init>(ULIDBasedSSTableId.java:52) at org.apache.cassandra.io.sstable.ULIDBasedSSTableId$Builder.lambda$generator$0(ULIDBasedSSTableId.java:129) ``` This can cause a flush to fail. Continue looping until newVal gets a value. The loop can spin until the corrected time catches up to the time of the most recently used ULID generation ID. This should be a short duration in a healthy cluster without large time corrections from sync. Tests are added in ULIDBasedSSTableIdGeneratorTest A package-protected constructor is introduced for ULIDBasedSSTableIdGeneratorTest.testGeneratorRetryOnEmptyOptional() Cassandra Applicability: upstream doesn't have ULIDBasedSSTableId (and won't because CASSANDRA-17048). (Rebase of commit e9afd3a)
…ly (#2186) ### What is the issue FSReadError extends IOError, so it was not caught by the catch block in filterCommitLogFiles(). This caused replayer failures when a remote commit log file was listed but no longer accessible. ### What does this PR fix and why was it fixed Only NoSuchFileException is handled gracefully; other IOErrors (such as corruption) are re-thrown to fail the replay as intended. (Rebase of commit a51cdcc)
(Rebase of commit 3dccb8c)
…d check (#2213) The NoSpamLogger clock override in was checking that accesses came from a single expected thread. This failed when internal threads accessed it during truncate in cleanupRepairTables(). Removed the thread check since the purpose is deterministic time, not thread enforcement. (Rebase of commit 94e3546)
PartitionAwarePrimaryKeyMap implements overcomplicated `ceiling` method calling `exactRowIdOrInvertedCeiling`. This commit Simplifies PartitionAwarePrimaryKeyMap.ceiling to use the corresponding correct method from the reader directly. This can be seen as a follow up to https://github.com/datastax/cassandra/pull/1096/files#diff-c5011580ab9b0d99d9e504570c4cccb152221d3dbe62c8a956e83fce9070b380 Rebase notes: - commit 48b55fb was dropped because it only contains lint/refactor changes and should instead be upstreamed (Rebase of commit e5c5d07)
The rationale behind this upgrade is to remediate CVE-2025-67735. The BoringSSL has been upgraded as well for consistency with 4.1.130.Final. CNDB PR: riptano/cndb#16481 (Rebase of commit 2e251a1)
The current version of `lz4-java` (1.8.0) has known vulnerabilities, namely [CVE-2025-12183](https://www.cve.org/CVERecord?id=CVE-2025-12183) [CVE-2025-66566](https://www.cve.org/CVERecord?id=CVE-2025-66566) This patch updates the lz4-java library from 1.8.0 to 1.10.2 As of lz4-java 1.8.1, the artifact coordinates changed from `org.lz4` to `at.yawk.lz4`, so this patch also updates the artifact coordinates (Rebase of commit af4abfb)
…d similar use the wrong endianness The legacy_*_clust_be_index_summary tests are ignored (for now) for aa-cb versions
``` CNDB-16350: Optimize ChronicleMap access, iteration to reduce serde cost ### What is the issue Fixes: riptano/cndb#16350 ### What does this PR fix and why was it fixed ChronicleMap gives us several ways to use lower level APIs to avoid deserializing keys/values and the associated allocation that comes with them. The first key thing to mention is that iteration is very expensive as these maps get big, so we want to avoid it if possible. The second is that if we use the typical map iteration methods, they deserialize the key and the value eagerly. Since the key is typically a high dimensional vector, it is valuable to avoid such deserialization. This change: * Removes unnecessary iteration leveraging the fact that compaction is additive * Replaces `forEach` with `forEachEntry`, which gives better semantics * Updates the `maybeAddVector` method to avoid serializing the vector key twice by using the `searchContext`. The `ChronicleMap#put` method uses this pattern internally I added two sets of benchmarks, however the `VectorCompactionBench` doesn't seem to register the benefit of the ChronicleMap. I am leaving `VectorCompactionBench` in place since it is still useful. Likely, this is because ChronicleMap's cost isn't as expensive as graph construction. Here are some of the benchmark results. They show between a 50x and 100x improvement. The improvement seems to increase as we build larger graphs. benchmark results before change: ``` [java] Benchmark (dimension) (numVectors) Mode Cnt Score Error Units [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 768 100000 avgt 5 271.569 ± 3.473 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 768 1000000 avgt 5 5452.393 ± 227.905 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 1536 100000 avgt 5 1392.607 ± 30.388 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 1536 1000000 avgt 5 11496.696 ± 345.886 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 768 100000 avgt 5 242.049 ± 20.708 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 768 1000000 avgt 5 2365.691 ± 84.173 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 1536 100000 avgt 5 265.395 ± 4.167 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 1536 1000000 avgt 5 3641.557 ± 130.649 ms/op ``` after change: ``` [java] Benchmark (dimension) (numVectors) Mode Cnt Score Error Units [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 768 100000 avgt 5 5.721 ± 1.727 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 768 1000000 avgt 5 124.536 ± 22.464 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 1536 100000 avgt 5 5.662 ± 0.610 ms/op [java] V5VectorPostingsWriterBench.createGenericIdentityMapping 1536 1000000 avgt 5 122.671 ± 3.343 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 768 100000 avgt 5 5.364 ± 1.194 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 768 1000000 avgt 5 119.449 ± 4.809 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 1536 100000 avgt 5 5.379 ± 0.552 ms/op [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany 1536 1000000 avgt 5 121.293 ± 3.040 ms/op ``` Co-authored-by: Michael Marshall <michael.marshall@datastax.com>
…ows/tombstones (#2169) riptano/cndb#16180 Port into main-5.0 commit 4fc5df6 ``` CNDB-15946: Revamp SAI metrics for fetched/returned keys/partitions/rows/tombstones (#2132) The metrics partitionsRead, rowsFiltered, rowsPreFiltered and shadowedPrimaryKeyCount, which don't always work as expected, are replaced by these metrics: * keysFetched: Number of partition/row keys that will be used to fetch rows from the base table. * partitionsFetched: Number of live partitions fetched from the base table, before post-filtering and sorting. Note that currently ANN fetches one partition per index key, without any grouping of same-partition clusterings. * partitionsReturned: Number of live partitions returned to the coordinator, after post-filtering and sorting. * partitionTombstonesFetched: Number of deleted partitions found when reading the base table. * rowsFetched: Number of live rows fetched from the base table, before post-filtering and sorting. * rowsReturned: Number of live rows returned to the coordinator, after post-filtering and sorting. * rowTombstonesFetched: Number deleted rows or row ranges found when reading the base table. StorageAttachedIndexSearcher is modified to use the command timestamp, rather than getting the current time every time a query timestamp is needed, which was possibly buggy and inefficient. ``` Snapshot SAI metrics when the query is done --------- Co-authored-by: Andrés de la Peña <adelapena@users.noreply.github.com> Co-authored-by: Andrés de la Peña <a.penya.garcia@gmail.com>
…2230) ### What is the issue CorruptedSSTablesCompactionsTest.testCorruptedSSTablesWithUnifiedCompactionStrategy-cdc is failing ### What does this PR fix and why was it fixed Adds VIntOutOfRangeException to caught exceptions in SortedStringTableCursor
### What is the issue The pycode compliance test is failing due to trailing whitespace ### What does this PR fix and why was it fixed Removes the whitespace
### What is the issue When reading SSTables containing dropped columns with tuple types (or UDTs containing tuples), the column ordering is being corrupted during bitmap deserialization. ### What does this PR fix and why was it fixed Fixes dropped column handling with User-Defined Types (UDTs). Column ordering depends on isComplex(), which depends on the type's isMultiCell(). When a dropped column's type had a different isMultiCell() value in the schema vs the SSTable, column ordering became incorrect, causing bitmap decode errors and data corruption.
### What is the issue Fixes #16793 CC5 doesn't understand CC4 system tables and generates a new host_id on upgrade ### What does this PR fix and why was it fixed Reads CC4's file-based node metadata stored in MessagePack format and converts them to CC5 LocalInfo and PeerInfo objects on first boot.
…void JUnit test timeouts. (#2121) ### What is the issue CNDB-14023, ForceRepairTest fails sometimes in CI with a JUnit test timeout. ### What does this PR fix and why was it fixed Adds a timeout while waiting for node to be marked down instead of waiting indefinitely and raising a JUnit timeout.
### What is the issue UCS settings files are not dropped after the table gets dropped. Instead they are supposed to be cleared after the node restart. The cleanup is faulty though and it prevents the node from startup. Root Cause: The cleanupControllerConfig() method in CompactionManager attempts to verify if a table exists by calling getColumnFamilyStore(). When the table is dropped, this method throws IllegalArgumentException, which was not being caught. The existing catch block only handled NullPointerException (for missing keyspace). ### What does this PR fix and why was it fixed Extended the exception handler to catch both NullPointerException and IllegalArgumentException, allowing orphaned controller-config.JSON files to be properly identified and deleted during node restart. 5.0 counterpart of #2145.
Adds storage-compatibility guards for pre-5.0 mode around auth/role and system schema behavior, defers incompatible changes to NONE
### What is the issue This file is missing causing the loadCommitLogAndSSTablesWithDroppedColumnTestCC50 test to fail ### What does this PR fix and why was it fixed Adds the missing file.
❌ Build ds-cassandra-pr-gate/PR-2257 rejected by Butler3 regressions found Found 3 new test failures
Found 1 known test failures |
|
Member
Author
|
the rebase on 5.0.7 has been done, working on fixing last failures. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



DO NOT MERGE
https://github.com/riptano/cndb/issues/16593
this will be forced pushed to
main-5.0(and
main-5.0will first be tagged tomain-5.0.4)pairs with: datastax/cassandra-dtest#94