[ENH] add MCMR property tests and fix dirty log detection by rescrv · Pull Request #6835 · chroma-core/chroma

rescrv · 2026-04-06T21:14:27Z

Description of changes

Introduce dedicated multi-cluster multi-region (MCMR) property tests
that write records across two regions and verify cross-region
replication invariants. Parameterize existing test_add tests by
database name so they run against both classic and MCMR databases
when multi-region is enabled.

Fix dirty log detection in the repl manifest manager:

Treat intrinsic_cursor=0 as unset so newly initialized logs are
not spuriously reported as dirty
Initialize manifest_regions rows with intrinsic_cursor=0 instead
of omitting the column, preventing GC from misinterpreting the
starting offset as an active reader position
Change the SQL threshold comparison from > to >= so logs at the
exact threshold boundary are correctly included
Add decode_intrinsic_cursor helper that filters out zero values

Retry Spanner-aborted flush compaction transactions:

Add is_spanner_aborted() to SysDbError to detect gRPC Aborted
status
Wrap the flush_collection_compaction read-write transaction with
backon retry (100ms delay, 4 attempts) on Spanner abort
Propagate gRPC status codes through SysDbError and
FlushCompactionError instead of mapping all Spanner errors to
Internal

Improve dirty log test diagnostics by dumping in-memory and Spanner
state on assertion failure. Add regression tests for single-add and
three-add dirty log scenarios.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

Co-authored-by: AI

github-actions · 2026-04-06T21:14:35Z

propel-code-bot · 2026-04-06T21:15:30Z

Add MCMR Property Coverage and Harden Dirty-Log/Spanner Reliability Paths

This PR is a major enhancement focused on multi-region correctness and replication reliability. It introduces dedicated MCMR property tests and expands existing property tests to run across database modes, while updating CI/workflows to support reproducible cluster-based execution. The goal is to validate cross-region invariants more thoroughly and catch replication edge cases earlier.

In parallel, the PR fixes dirty-log detection semantics in WAL manifest management and improves Spanner transaction robustness by adding retry behavior for abort/cancel scenarios and preserving gRPC status codes through error layers. It also adds regression tests for dirty-log scenarios, null-safe invariant handling in Python tests, structured WAL logging, and worker config updates for record_count_threshold.

This summary was automatically generated by @propel-code-bot

propel-code-bot

Review found minor, recommended fixes to improve reliability and test resource management without blocking the PR.

Status: Minor Suggestions | Risk: Medium

Issues Identified & Suggestions

Handle and log reader iteration errors to avoid silent truncation: rust/log-service/src/lib.rs
Close per-example HTTP clients to prevent leaked systems/connections: chromadb/test/property/test_add_mcmr.py

Review Details

📁 13 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-06T21:22:12Z

rust/log-service/src/lib.rs

+    async fn dump_collectible_state(server: &LogServer, db_name: &str) -> String {
+        let mut dump = String::new();
+        {
+            let need_to_compact_repl = server.need_to_compact_repl.lock();
+            dump.push_str("in-memory need_to_compact_repl:\n");
+            for ((topology, collection_id), rollup) in need_to_compact_repl.iter() {
+                dump.push_str(&format!(
+                    "  topology={topology} collection_id={collection_id} start={} limit={} reinsert_count={} initial_insertion_epoch_us={}\n",
+                    rollup.start_log_position.offset(),
+                    rollup.limit_log_position.offset(),
+                    rollup.reinsert_count,
+                    rollup.initial_insertion_epoch_us,
+                ));
+            }
+        }
+
+        let Some(topology_name) = DatabaseName::new(db_name)
+            .and_then(|db| db.topology())
+            .and_then(|topology| TopologyName::new(topology).ok())


[Reliability] In dump_collectible_state, the raw Spanner query is executed using topology.config.spanner.read_only_transaction() but the transaction handle tx is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the reader borrows from tx and both must stay alive for the while let Ok(Some(row)) = reader.next().await loop. If reader.next() returns an Err variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to:

while let Some(row) = reader.next().await.transpose() { match row { ... } }

or at minimum logging the error variant so diagnostic dumps aren't silently truncated.

Context for Agents

In `dump_collectible_state`, the raw Spanner query is executed using `topology.config.spanner.read_only_transaction()` but the transaction handle `tx` is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the `reader` borrows from `tx` and both must stay alive for the `while let Ok(Some(row)) = reader.next().await` loop. If `reader.next()` returns an `Err` variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to: ```rust while let Some(row) = reader.next().await.transpose() { match row { ... } } ``` or at minimum logging the error variant so diagnostic dumps aren't silently truncated. File: rust/log-service/src/lib.rs Line: 5233

propel-code-bot · 2026-04-06T21:22:12Z

chromadb/test/property/test_add_mcmr.py

+    record_set: strategies.RecordSet,
+    should_compact: bool,
+) -> None:
+    _test_add(collection, record_set, should_compact)


[Performance] test_add_small, test_add_medium, and test_add_large call _create_mcmr_clients() on every Hypothesis example. Each HttpClient() call creates a new System with a unique UUID identifier (see SharedSystemClient._get_identifier_from_settings), and these Systems accumulate in the static _identifier_to_system dict without cleanup. With max_examples=50 for test_add_small, that's 100 System instances with HTTP connections that are never released.

At minimum, add explicit cleanup at the end of _test_add() and test_add_large():

client1.close() client2.close()

This matches the cleanup pattern used elsewhere in the test suite (e.g., test_client.py calls clear_system_cache(), test_restart_persist.py calls system.stop()).

Context for Agents

`test_add_small`, `test_add_medium`, and `test_add_large` call `_create_mcmr_clients()` on every Hypothesis example. Each `HttpClient()` call creates a new `System` with a unique UUID identifier (see `SharedSystemClient._get_identifier_from_settings`), and these Systems accumulate in the static `_identifier_to_system` dict without cleanup. With `max_examples=50` for `test_add_small`, that's 100 System instances with HTTP connections that are never released. At minimum, add explicit cleanup at the end of `_test_add()` and `test_add_large()`: ```python client1.close() client2.close() ``` This matches the cleanup pattern used elsewhere in the test suite (e.g., `test_client.py` calls `clear_system_cache()`, `test_restart_persist.py` calls `system.stop()`). File: chromadb/test/property/test_add_mcmr.py Line: 48

propel-code-bot

Solid enhancement overall, with one recommended testing reliability fix for asynchronous replication timing.

Status: Minor Suggestions | Risk: Medium

Issues Identified & Suggestions

Add wait/replication check before cross-region invariant assertions when no compaction: chromadb/test/property/test_add_mcmr.py

Review Details

📁 13 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-06T22:33:21Z

chromadb/test/property/test_add_mcmr.py

+                    coll,
+                    cast(strategies.RecordSet, normalized_record_set),
+                    n_results=n_results,
+                    embedding_function=collection.embedding_function,
+                    query_indices=list(
+                        range(i, min(i + batch_size, len(normalized_record_set["ids"])))
+                    ),


[Testing] After all batches are written (alternating between coll1 and coll2), the invariants are checked on both collections without waiting for compaction when should_compact=False. The invariants.count and invariants.ann_accuracy checks on coll2 for data written via coll1 (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and should_compact=False, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing.

The existing test_add.py tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when should_compact=False.

Context for Agents

After all batches are written (alternating between `coll1` and `coll2`), the invariants are checked on both collections **without waiting for compaction** when `should_compact=False`. The `invariants.count` and `invariants.ann_accuracy` checks on `coll2` for data written via `coll1` (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and `should_compact=False`, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing. The existing `test_add.py` tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when `should_compact=False`. File: chromadb/test/property/test_add_mcmr.py Line: 205

propel-code-bot

Review found one minor testing reliability suggestion: add runtime skip guards for MCMR-only tests to avoid CI connection failures in non-cluster environments.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions

Add explicit MCMR runtime skip guard for non-cluster CI runs: chromadb/test/property/test_add_mcmr.py

Review Details

📁 14 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-06T23:59:54Z

chromadb/test/property/test_add_mcmr.py

+@given(
+    collection=collection_st,
+    record_set=strategies.recordsets(collection_st, min_size=1, max_size=500),
+    should_compact=st.booleans(),
+)
+@settings(
+    deadline=None,
+    parent=override_hypothesis_profile(
+        normal=hypothesis.settings(max_examples=50),
+        fast=hypothesis.settings(max_examples=20),
+    ),
+)
+def test_add_small(
+    collection: strategies.Collection,
+    record_set: strategies.RecordSet,
+    should_compact: bool,
+) -> None:
+    _test_add(collection, record_set, should_compact)


[Testing] test_add_small and test_add_medium in this file take no client fixture and unconditionally call _create_mcmr_clients() which connects to localhost:8000 and localhost:8001. Unlike the tests in test_add.py, there is no guard checking NOT_CLUSTER_ONLY or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped.

The CI workflow excludes test_add_mcmr.py from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in _create_mcmr_clients() to fail fast with a clear message:

def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]: if NOT_CLUSTER_ONLY: pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)") ...

Context for Agents

`test_add_small` and `test_add_medium` in this file take no `client` fixture and unconditionally call `_create_mcmr_clients()` which connects to `localhost:8000` and `localhost:8001`. Unlike the tests in `test_add.py`, there is no guard checking `NOT_CLUSTER_ONLY` or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped. The CI workflow excludes `test_add_mcmr.py` from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in `_create_mcmr_clients()` to fail fast with a clear message: ```python def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]: if NOT_CLUSTER_ONLY: pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)") ... ``` File: chromadb/test/property/test_add_mcmr.py Line: 48

propel-code-bot

Strong MCMR and logging improvements, but key correctness and observability fixes are still needed before merge.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions

Add final compaction wait to verify replication on small writes: chromadb/test/property/test_add_mcmr.py
Use wall-clock insertion timestamp, not log offset, for first_log_ts: rust/log-service/src/lib.rs

Review Details

📁 15 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-07T15:05:25Z

chromadb/test/property/test_add_mcmr.py

+    )
+
+    normalized_record_set = invariants.wrap_all(record_set)
+    should_wait_for_compaction = not NOT_CLUSTER_ONLY and should_compact


[Testing] The database_name fixture parameterizes tests in test_add.py to run against both classic and MCMR databases, but _test_add in test_add_mcmr.py always creates its own isolated database via _create_isolated_database_mcmr and ignores any incoming database context. This is intentional by design, but the test_add_small and test_add_medium functions in this file accept no database_name parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: should_wait_for_compaction is computed as not NOT_CLUSTER_ONLY and should_compact, but the MCMR test never waits for compaction after all batches are written — it only waits mid-loop when records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS. If the total record count is under MIN_RECORDS_BETWEEN_COMPACTION_WAITS (e.g., a single record from test_add_small with min_size=1) and should_compact=True, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified.

Fix: after the loop, add a final wait if should_wait_for_compaction is true and compaction has not yet been awaited:

if should_wait_for_compaction and records_since_compaction_wait > 0: current_version1 = wait_for_version_increase( client1, collection.name, current_version1 ) current_version2 = wait_for_version_increase( client2, collection.name, current_version2 )

Context for Agents

The `database_name` fixture parameterizes tests in `test_add.py` to run against both classic and MCMR databases, but `_test_add` in `test_add_mcmr.py` always creates its own isolated database via `_create_isolated_database_mcmr` and ignores any incoming database context. This is intentional by design, but the `test_add_small` and `test_add_medium` functions in this file accept no `database_name` parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: `should_wait_for_compaction` is computed as `not NOT_CLUSTER_ONLY and should_compact`, but the MCMR test never waits for compaction *after all batches are written* — it only waits mid-loop when `records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS`. If the total record count is under `MIN_RECORDS_BETWEEN_COMPACTION_WAITS` (e.g., a single record from `test_add_small` with `min_size=1`) and `should_compact=True`, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified. Fix: after the loop, add a final wait if `should_wait_for_compaction` is true and compaction has not yet been awaited: ```python if should_wait_for_compaction and records_since_compaction_wait > 0: current_version1 = wait_for_version_increase( client1, collection.name, current_version1 ) current_version2 = wait_for_version_increase( client2, collection.name, current_version2 ) ``` File: chromadb/test/property/test_add_mcmr.py Line: 157

propel-code-bot · 2026-04-07T15:05:25Z

rust/log-service/src/lib.rs

+                collection_id = %collection_id,
+                topology_name = ?topology_name,
+                first_log_offset = rollup.start_log_position.offset(),
+                first_log_ts = rollup.start_log_position.offset(),


[Maintainability] first_log_ts = rollup.start_log_position.offset() logs the offset as the timestamp, but first_log_ts semantically should represent a wall-clock timestamp in microseconds (matching the CollectionInfo.first_log_ts field). The rollup's start_log_position.offset() is a log offset, not a timestamp. The correct field would be rollup.initial_insertion_epoch_us for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs.

Change to:

first_log_ts = rollup.initial_insertion_epoch_us,

Context for Agents

`first_log_ts = rollup.start_log_position.offset()` logs the offset as the timestamp, but `first_log_ts` semantically should represent a wall-clock timestamp in microseconds (matching the `CollectionInfo.first_log_ts` field). The rollup's `start_log_position.offset()` is a log offset, not a timestamp. The correct field would be `rollup.initial_insertion_epoch_us` for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs. Change to: ```rust first_log_ts = rollup.initial_insertion_epoch_us, ``` File: rust/log-service/src/lib.rs Line: 1555

propel-code-bot

Changes improve reliability and coverage, but two issues need follow-up: a dirty-log edge case at zero threshold and potential MCMR test flakiness from missing compaction waits.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions

Zero threshold may mark uninitialized logs perpetually dirty; add guard or ban zero configs: rust/wal3/src/interfaces/repl/manifest_manager.rs
MCMR property test may assert before compaction completes; force/document compaction wait: chromadb/test/property/test_add.py

Review Details

📁 15 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

rust/wal3/src/interfaces/repl/manifest_manager.rs

propel-code-bot · 2026-04-07T15:56:37Z

chromadb/test/property/test_add.py

            "TODO @jai, come back and debug why CI runners fail with async + sync"
        )
-    _test_add(client, collection, record_set, True, always_compact=True)
+    _test_add(client, collection, record_set, True, always_compact=not MULTI_REGION_ENABLED)


[Testing] When MULTI_REGION_ENABLED is true, always_compact=False is passed to _test_add. The _test_add helper only forces a compaction wait when always_compact=True and no intermediate wait was triggered. With min_size=1, max_size=5, Hypothesis nearly always generates record sets smaller than MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10, so no intermediate wait fires. This means test_add_miniscule will routinely check invariants before compaction completes in MCMR mode, producing flaky test results.

If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing always_compact=True unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads.

Context for Agents

When `MULTI_REGION_ENABLED` is true, `always_compact=False` is passed to `_test_add`. The `_test_add` helper only forces a compaction wait when `always_compact=True` and no intermediate wait was triggered. With `min_size=1, max_size=5`, Hypothesis nearly always generates record sets smaller than `MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10`, so no intermediate wait fires. This means `test_add_miniscule` will routinely check invariants before compaction completes in MCMR mode, producing flaky test results. If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing `always_compact=True` unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads. File: chromadb/test/property/test_add.py Line: 71

+1 ideally we also test compactions with mcmr

We do. I went to great lengths in this PR to make them work. This is an outdated hallucination. The only time we will not compact with mcmr is when fewer than 10 records are inserted.

propel-code-bot

Action required: one critical test logic bug and two reliability/performance test-flow issues should be fixed before merge.

Status: Action Required | Risk: High

Issues Identified & Suggestions

Fix undefined variable causing runtime NameError in add property test: chromadb/test/property/test_add.py
Add retry/backoff for cross-region collection fetch race after create: chromadb/test/property/test_add_mcmr.py
Move async skip before DB creation to prevent leaked resources: chromadb/test/property/test_add_mcmr.py

Review Details

📁 15 files reviewed | 💬 3 comments

👍 / 👎 individual comments to help improve reviews for you

chromadb/test/property/test_add.py

propel-code-bot · 2026-04-07T17:40:10Z

chromadb/test/property/test_add_mcmr.py

+    coll2 = client2.get_collection(
+        name=collection.name,
+        embedding_function=collection.embedding_function,
+    )


[Reliability] In _test_add (mcmr), coll2 is retrieved with client2.get_collection(...) immediately after client1.create_collection(...). In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the get_collection call, causing a NotFoundError. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes).

Context for Agents

In `_test_add` (mcmr), `coll2` is retrieved with `client2.get_collection(...)` immediately after `client1.create_collection(...)`. In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the `get_collection` call, causing a `NotFoundError`. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes). File: chromadb/test/property/test_add_mcmr.py Line: 154

propel-code-bot · 2026-04-07T17:40:10Z

chromadb/test/property/test_add_mcmr.py

+    if (
+        client1.get_settings().chroma_api_impl
+        == "chromadb.api.async_fastapi.AsyncFastAPI"
+    ):
+        pytest.skip(
+            "TODO @jai, come back and debug why CI runners fail with async + sync"
+        )


[Performance] In test_add_large (mcmr), the async/sync skip check (lines 254-260) is placed after _create_mcmr_clients() and _create_isolated_database_mcmr() have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the pytest.skip check between _create_mcmr_clients() and _create_isolated_database_mcmr() — the check requires client1 to call get_settings(), but the database creation should not happen if the test will be skipped.

Context for Agents

In `test_add_large` (mcmr), the async/sync skip check (lines 254-260) is placed *after* `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the `pytest.skip` check between `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` — the check requires `client1` to call `get_settings()`, but the database creation should not happen if the test will be skipped. File: chromadb/test/property/test_add_mcmr.py Line: 260

propel-code-bot

Review found one minor testing recommendation to improve Hypothesis stability in a property test.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions

Add missing Hypothesis health-check suppression for shrinking reliability: chromadb/test/property/test_add_mcmr.py

Review Details

📁 15 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-07T17:50:55Z

chromadb/test/property/test_add_mcmr.py

+        fast=hypothesis.settings(max_examples=20),
+    ),
+)
+def test_add_small(


[Testing] test_add_small and test_add_medium do not have the suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...] setting. Because _test_add creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a HealthCheck.function_scoped_fixture error during shrinking. test_add_medium suppresses it but test_add_small does not. Add it to test_add_small's @settings.

Context for Agents

`test_add_small` and `test_add_medium` do not have the `suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...]` setting. Because `_test_add` creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a `HealthCheck.function_scoped_fixture` error during shrinking. `test_add_medium` suppresses it but `test_add_small` does not. Add it to `test_add_small`'s `@settings`. File: chromadb/test/property/test_add_mcmr.py Line: 43

propel-code-bot

Overall solid enhancement with one recommended test-coverage improvement for MCMR write-path validation.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions

Single-batch cases may miss region2→region1 replication path; force both writes. chromadb/test/property/test_add_mcmr.py

Review Details

📁 15 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-07T20:43:06Z

chromadb/test/property/test_add_mcmr.py

+        create_batches(
+            api=client1,
+            ids=cast(List[str], record_set["ids"]),
+            embeddings=cast(Embeddings, record_set["embeddings"]),
+            metadatas=cast(Metadatas, record_set["metadatas"]),
+            documents=cast(List[str], record_set["documents"]),
+        )
+    )
+    for batch_index, batch in enumerate(batches):
+        if batch_index % 2 == 0:
+            coll1.add(*batch)


[Testing] The alternating write pattern (batch_index % 2 == 0 → coll1, else → coll2) means that if Hypothesis generates a record_set with only one batch (which is common for min_size=1), all writes go to coll1 and coll2 never receives any writes in that example. The cross-region invariant check on coll2 then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases.

For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to coll2 for odd-indexed examples, to ensure both write paths are exercised.

Context for Agents

The alternating write pattern (`batch_index % 2 == 0` → `coll1`, else → `coll2`) means that if Hypothesis generates a `record_set` with only one batch (which is common for `min_size=1`), all writes go to `coll1` and `coll2` never receives any writes in that example. The cross-region invariant check on `coll2` then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases. For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to `coll2` for odd-indexed examples, to ensure both write paths are exercised. File: chromadb/test/property/test_add_mcmr.py Line: 175

I'm fine with that as odds are we'll have two batches.

propel-code-bot

The PR is strong overall, but one important reliability concern was identified in retry classification.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions

Reconsider retrying CANCELLED; treat as terminal unless explicitly safe. rust/rust-sysdb/src/types.rs

Review Details

📁 18 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

propel-code-bot · 2026-04-07T21:29:22Z

rust/rust-sysdb/src/types.rs

+    pub(crate) fn is_retryable_spanner_status(&self) -> bool {
+        self.try_as().is_some_and(|status| {
+            matches!(status.code(), tonic::Code::Aborted | tonic::Code::Cancelled)


[Reliability] is_retryable_spanner_status in rust/rust-sysdb/src/types.rs matches tonic::Code::Cancelled as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats ABORTED as the standard retry signal for read-write transactions; CANCELLED usually indicates intentional cancellation by the caller. Consider whether Cancelled should be retried here or handled as a terminal error.

Context for Agents

`is_retryable_spanner_status` in `rust/rust-sysdb/src/types.rs` matches `tonic::Code::Cancelled` as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats `ABORTED` as the standard retry signal for read-write transactions; `CANCELLED` usually indicates intentional cancellation by the caller. Consider whether `Cancelled` should be retried here or handled as a terminal error. File: rust/rust-sysdb/src/types.rs Line: 1376

propel-code-bot

Review found no issues; changes appear reliable, well-tested, and low risk.

Status: No Issues Found | Risk: Low

Review Details

📁 16 files reviewed | 💬 0 comments

rescrv

Gave a self review to cut most of the review code.

rescrv · 2026-04-07T23:50:32Z

rust/log-service/src/lib.rs

-                // Use zero threshold in integration tests to surface small repl-dirty changes
-                // immediately. The default threshold (100) hides small writes.
-                record_count_threshold: 0,
+                record_count_threshold: 1,


This is necessary for mcmr because 0 returns every collection.

rescrv · 2026-04-07T23:51:23Z

rust/rust-sysdb/src/spanner.rs

+            .with_max_times(4);
+        let db_name_for_log = req.name.clone().into_string();
+
        // Use a read-write transaction to atomically check tenant, check database, and insert


This block is largely an indent.

rescrv · 2026-04-07T23:51:50Z

rust/rust-sysdb/src/spanner.rs

-        if let Some(ref database_name) = filter.database_name {
-            stmt.add_param("database_name", &database_name.as_ref());
-        }
+        let backoff = ConstantBuilder::default()


Another backoff loop getting indented.

rescrv · 2026-04-07T23:52:14Z

rust/rust-sysdb/src/spanner.rs

-        let result = self
-        .client
-        .read_write_transaction::<(), SysDbError, _>(|tx| {
+        let result = (|| {


Another backof/retry

Introduce dedicated multi-cluster multi-region (MCMR) property tests that write records across two regions and verify cross-region replication invariants. Parameterize existing test_add tests by database name so they run against both classic and MCMR databases when multi-region is enabled. Fix dirty log detection in the repl manifest manager: - Treat intrinsic_cursor=0 as unset so newly initialized logs are not spuriously reported as dirty - Initialize manifest_regions rows with intrinsic_cursor=0 instead of omitting the column, preventing GC from misinterpreting the starting offset as an active reader position - Change the SQL threshold comparison from > to >= so logs at the exact threshold boundary are correctly included - Add decode_intrinsic_cursor helper that filters out zero values Retry Spanner-aborted flush compaction transactions: - Add is_spanner_aborted() to SysDbError to detect gRPC Aborted status - Wrap the flush_collection_compaction read-write transaction with backon retry (100ms delay, 4 attempts) on Spanner abort - Propagate gRPC status codes through SysDbError and FlushCompactionError instead of mapping all Spanner errors to Internal Improve dirty log test diagnostics by dumping in-memory and Spanner state on assertion failure. Add regression tests for single-add and three-add dirty log scenarios. Co-authored-by: AI

This reverts commit 0eebd27.

sanketkedia · 2026-04-10T19:57:53Z

.github/actions/tilt/action.yaml

        kubectl -n chroma port-forward svc/rust-log-service 50054:50051 &
        kubectl -n chroma port-forward svc/query-service 50053:50051 &
        kubectl -n chroma port-forward svc/rust-frontend-service 8000:8000 &
+        kubectl -n chroma2 port-forward svc/rust-frontend-service 8001:8000 &


curious what happens in single region case here. If chroma2 FE is not brought up then does this port forward fail?

If nothing tries to connect it will do nothing. When they try to connect it'll error.

sanketkedia · 2026-04-10T20:00:34Z

chromadb/test/property/invariants.py

        distance_function = distance_functions.l2

    accuracy_threshold = 1e-6
-    assert collection.metadata is not None


curious what was the reason behind this change?

Because it was none. I made a change above to make sure it's always not none.

sanketkedia · 2026-04-10T20:01:40Z

chromadb/test/property/test_add.py

    parent=override_hypothesis_profile(
-        normal=hypothesis.settings(max_examples=500),
-        fast=hypothesis.settings(max_examples=200),
+        normal=hypothesis.settings(max_examples=100),


are these to make ci faster?

Yes. It took 2 hours to run this test without cutting examples because we inflated things with mcmr and it's more than twice as expensive.

sanketkedia · 2026-04-10T20:11:50Z

chromadb/test/property/test_add.py

    _test_add(client, collection, record_set, should_compact)


-@multi_region_test


IIRC the point of this decorator was to enable selective tests to be mcmr. Are we now introducing a new way of doing this?

It wasn't doing the trick. Things were still going through the classic database. Adding a database_name as a fixture did it.

sanketkedia · 2026-04-10T20:12:43Z

chromadb/test/property/test_add.py

 )
 def test_add_medium(
    client: ClientAPI,
+    database_name: str,


nit: this param is not used anywhere below

It's a fixture. Not needed to be used to configure the test to use one database name or the other. It's implicit magic. I can document.

sanketkedia · 2026-04-10T20:12:54Z

chromadb/test/property/test_add.py

 )
 def test_add_small(
    client: ClientAPI,
+    database_name: str,


nit: this param is not used anywhere below

Same nit above.

sanketkedia · 2026-04-10T20:16:48Z

Reducing the number of examples might warrant a broader discussion with the team because I remember we had increased it in the past and it caught quite a few failures because of the increase

propel-code-bot bot reviewed Apr 6, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

propel-code-bot bot reviewed Apr 6, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

This comment has been minimized.

Sign in to view

propel-code-bot bot reviewed Apr 6, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

propel-code-bot bot reviewed Apr 7, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

This comment was marked as outdated.

Sign in to view

propel-code-bot bot reviewed Apr 7, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

This comment was marked as outdated.

Sign in to view

propel-code-bot bot reviewed Apr 7, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

propel-code-bot bot reviewed Apr 7, 2026

View reviewed changes

rescrv commented Apr 7, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

rescrv requested a review from tanujnay112 April 8, 2026 17:49

rescrv changed the base branch from rescrv/partial_manifest_read to main April 9, 2026 17:42

rescrv force-pushed the rescrv/mcmr-tests branch from 4c35d83 to 18e0a5b Compare April 9, 2026 17:42

This comment has been minimized.

Sign in to view

hashbender mentioned this pull request Apr 9, 2026

[Mirror] [ENH] add MCMR property tests and fix dirty log detection (upstream #6835) hashbender/chroma#3

Open

This comment has been minimized.

Sign in to view

rescrv added 20 commits April 9, 2026 16:03

more spanner in sysdb

78e4d29

clippy

b42aecd

improve tracing

2889141

add slopified logs

678dd83

batch size for test add mcmr

5c11f4a

more backoff/retry of queries for ci

27b5e13

undo python change

13c07e7

additional wait time as it exceeded 240 locally

87e8436

cut miniscule tests in half and do similar for large

def953b

selectively increase MIN_RECORDS_BETWEEN_COMPACTION_WAITS

44b381c

typo

e710cea

retry on cancelled

9190b4a

cut tests for experiment

bc18426

Do not fail test_add on 2-record recordset

29ec137

remove traceback

b2c5412

Revert "add slopified logs"

7cd86e1

This reverts commit 0eebd27.

cleanup

c44e350

clippy

cdf3811

cargo fmt

ed7318d

a debug line

bf7c8e8

rescrv force-pushed the rescrv/mcmr-tests branch from 18e0a5b to bf7c8e8 Compare April 9, 2026 23:07

rescrv requested review from sanketkedia and removed request for tanujnay112 April 9, 2026 23:34

sanketkedia reviewed Apr 10, 2026

View reviewed changes

		_test_add(client, collection, record_set, should_compact)


		@multi_region_test

Conversation

rescrv commented Apr 6, 2026

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions bot commented Apr 6, 2026

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment has been minimized.

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment was marked as outdated.

Uh oh!

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

propel-code-bot bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

propel-code-bot bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Apr 7, 2026

propel-code-bot bot commented Apr 6, 2026 •

edited

Loading