Skip to content

[ENH] add MCMR property tests and fix dirty log detection#6835

Open
rescrv wants to merge 21 commits intomainfrom
rescrv/mcmr-tests
Open

[ENH] add MCMR property tests and fix dirty log detection#6835
rescrv wants to merge 21 commits intomainfrom
rescrv/mcmr-tests

Conversation

@rescrv
Copy link
Copy Markdown
Contributor

@rescrv rescrv commented Apr 6, 2026

Description of changes

Introduce dedicated multi-cluster multi-region (MCMR) property tests
that write records across two regions and verify cross-region
replication invariants. Parameterize existing test_add tests by
database name so they run against both classic and MCMR databases
when multi-region is enabled.

Fix dirty log detection in the repl manifest manager:

  • Treat intrinsic_cursor=0 as unset so newly initialized logs are
    not spuriously reported as dirty
  • Initialize manifest_regions rows with intrinsic_cursor=0 instead
    of omitting the column, preventing GC from misinterpreting the
    starting offset as an active reader position
  • Change the SQL threshold comparison from > to >= so logs at the
    exact threshold boundary are correctly included
  • Add decode_intrinsic_cursor helper that filters out zero values

Retry Spanner-aborted flush compaction transactions:

  • Add is_spanner_aborted() to SysDbError to detect gRPC Aborted
    status
  • Wrap the flush_collection_compaction read-write transaction with
    backon retry (100ms delay, 4 attempts) on Spanner abort
  • Propagate gRPC status codes through SysDbError and
    FlushCompactionError instead of mapping all Spanner errors to
    Internal

Improve dirty log test diagnostics by dumping in-memory and Spanner
state on assertion failure. Add regression tests for single-add and
three-add dirty log scenarios.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

Co-authored-by: AI

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Apr 6, 2026

Add MCMR Property Coverage and Harden Dirty-Log/Spanner Reliability Paths

This PR is a major enhancement focused on multi-region correctness and replication reliability. It introduces dedicated MCMR property tests and expands existing property tests to run across database modes, while updating CI/workflows to support reproducible cluster-based execution. The goal is to validate cross-region invariants more thoroughly and catch replication edge cases earlier.

In parallel, the PR fixes dirty-log detection semantics in WAL manifest management and improves Spanner transaction robustness by adding retry behavior for abort/cancel scenarios and preserving gRPC status codes through error layers. It also adds regression tests for dirty-log scenarios, null-safe invariant handling in Python tests, structured WAL logging, and worker config updates for record_count_threshold.

This summary was automatically generated by @propel-code-bot

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found minor, recommended fixes to improve reliability and test resource management without blocking the PR.

Status: Minor Suggestions | Risk: Medium

Issues Identified & Suggestions
  • Handle and log reader iteration errors to avoid silent truncation: rust/log-service/src/lib.rs
  • Close per-example HTTP clients to prevent leaked systems/connections: chromadb/test/property/test_add_mcmr.py
Review Details

📁 13 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +5215 to +5233
async fn dump_collectible_state(server: &LogServer, db_name: &str) -> String {
let mut dump = String::new();
{
let need_to_compact_repl = server.need_to_compact_repl.lock();
dump.push_str("in-memory need_to_compact_repl:\n");
for ((topology, collection_id), rollup) in need_to_compact_repl.iter() {
dump.push_str(&format!(
" topology={topology} collection_id={collection_id} start={} limit={} reinsert_count={} initial_insertion_epoch_us={}\n",
rollup.start_log_position.offset(),
rollup.limit_log_position.offset(),
rollup.reinsert_count,
rollup.initial_insertion_epoch_us,
));
}
}

let Some(topology_name) = DatabaseName::new(db_name)
.and_then(|db| db.topology())
.and_then(|topology| TopologyName::new(topology).ok())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Reliability] In dump_collectible_state, the raw Spanner query is executed using topology.config.spanner.read_only_transaction() but the transaction handle tx is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the reader borrows from tx and both must stay alive for the while let Ok(Some(row)) = reader.next().await loop. If reader.next() returns an Err variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to:

while let Some(row) = reader.next().await.transpose() {
    match row { ... }
}

or at minimum logging the error variant so diagnostic dumps aren't silently truncated.

Context for Agents
In `dump_collectible_state`, the raw Spanner query is executed using `topology.config.spanner.read_only_transaction()` but the transaction handle `tx` is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the `reader` borrows from `tx` and both must stay alive for the `while let Ok(Some(row)) = reader.next().await` loop. If `reader.next()` returns an `Err` variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to:
```rust
while let Some(row) = reader.next().await.transpose() {
    match row { ... }
}
```
or at minimum logging the error variant so diagnostic dumps aren't silently truncated.

File: rust/log-service/src/lib.rs
Line: 5233

record_set: strategies.RecordSet,
should_compact: bool,
) -> None:
_test_add(collection, record_set, should_compact)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Performance] test_add_small, test_add_medium, and test_add_large call _create_mcmr_clients() on every Hypothesis example. Each HttpClient() call creates a new System with a unique UUID identifier (see SharedSystemClient._get_identifier_from_settings), and these Systems accumulate in the static _identifier_to_system dict without cleanup. With max_examples=50 for test_add_small, that's 100 System instances with HTTP connections that are never released.

At minimum, add explicit cleanup at the end of _test_add() and test_add_large():

client1.close()
client2.close()

This matches the cleanup pattern used elsewhere in the test suite (e.g., test_client.py calls clear_system_cache(), test_restart_persist.py calls system.stop()).

Context for Agents
`test_add_small`, `test_add_medium`, and `test_add_large` call `_create_mcmr_clients()` on every Hypothesis example. Each `HttpClient()` call creates a new `System` with a unique UUID identifier (see `SharedSystemClient._get_identifier_from_settings`), and these Systems accumulate in the static `_identifier_to_system` dict without cleanup. With `max_examples=50` for `test_add_small`, that's 100 System instances with HTTP connections that are never released.

At minimum, add explicit cleanup at the end of `_test_add()` and `test_add_large()`:
```python
client1.close()
client2.close()
```
This matches the cleanup pattern used elsewhere in the test suite (e.g., `test_client.py` calls `clear_system_cache()`, `test_restart_persist.py` calls `system.stop()`).

File: chromadb/test/property/test_add_mcmr.py
Line: 48

@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid enhancement overall, with one recommended testing reliability fix for asynchronous replication timing.

Status: Minor Suggestions | Risk: Medium

Issues Identified & Suggestions
  • Add wait/replication check before cross-region invariant assertions when no compaction: chromadb/test/property/test_add_mcmr.py
Review Details

📁 13 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +199 to +205
coll,
cast(strategies.RecordSet, normalized_record_set),
n_results=n_results,
embedding_function=collection.embedding_function,
query_indices=list(
range(i, min(i + batch_size, len(normalized_record_set["ids"])))
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Testing] After all batches are written (alternating between coll1 and coll2), the invariants are checked on both collections without waiting for compaction when should_compact=False. The invariants.count and invariants.ann_accuracy checks on coll2 for data written via coll1 (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and should_compact=False, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing.

The existing test_add.py tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when should_compact=False.

Context for Agents
After all batches are written (alternating between `coll1` and `coll2`), the invariants are checked on both collections **without waiting for compaction** when `should_compact=False`. The `invariants.count` and `invariants.ann_accuracy` checks on `coll2` for data written via `coll1` (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and `should_compact=False`, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing.

The existing `test_add.py` tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when `should_compact=False`.

File: chromadb/test/property/test_add_mcmr.py
Line: 205

propel-code-bot[bot]

This comment was marked as outdated.

@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found one minor testing reliability suggestion: add runtime skip guards for MCMR-only tests to avoid CI connection failures in non-cluster environments.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions
  • Add explicit MCMR runtime skip guard for non-cluster CI runs: chromadb/test/property/test_add_mcmr.py
Review Details

📁 14 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +31 to +48
@given(
collection=collection_st,
record_set=strategies.recordsets(collection_st, min_size=1, max_size=500),
should_compact=st.booleans(),
)
@settings(
deadline=None,
parent=override_hypothesis_profile(
normal=hypothesis.settings(max_examples=50),
fast=hypothesis.settings(max_examples=20),
),
)
def test_add_small(
collection: strategies.Collection,
record_set: strategies.RecordSet,
should_compact: bool,
) -> None:
_test_add(collection, record_set, should_compact)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Testing] test_add_small and test_add_medium in this file take no client fixture and unconditionally call _create_mcmr_clients() which connects to localhost:8000 and localhost:8001. Unlike the tests in test_add.py, there is no guard checking NOT_CLUSTER_ONLY or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped.

The CI workflow excludes test_add_mcmr.py from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in _create_mcmr_clients() to fail fast with a clear message:

def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]:
    if NOT_CLUSTER_ONLY:
        pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)")
    ...
Context for Agents
`test_add_small` and `test_add_medium` in this file take no `client` fixture and unconditionally call `_create_mcmr_clients()` which connects to `localhost:8000` and `localhost:8001`. Unlike the tests in `test_add.py`, there is no guard checking `NOT_CLUSTER_ONLY` or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped.

The CI workflow excludes `test_add_mcmr.py` from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in `_create_mcmr_clients()` to fail fast with a clear message:

```python
def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]:
    if NOT_CLUSTER_ONLY:
        pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)")
    ...
```

File: chromadb/test/property/test_add_mcmr.py
Line: 48

@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong MCMR and logging improvements, but key correctness and observability fixes are still needed before merge.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions
  • Add final compaction wait to verify replication on small writes: chromadb/test/property/test_add_mcmr.py
  • Use wall-clock insertion timestamp, not log offset, for first_log_ts: rust/log-service/src/lib.rs
Review Details

📁 15 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

)

normalized_record_set = invariants.wrap_all(record_set)
should_wait_for_compaction = not NOT_CLUSTER_ONLY and should_compact
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Testing] The database_name fixture parameterizes tests in test_add.py to run against both classic and MCMR databases, but _test_add in test_add_mcmr.py always creates its own isolated database via _create_isolated_database_mcmr and ignores any incoming database context. This is intentional by design, but the test_add_small and test_add_medium functions in this file accept no database_name parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: should_wait_for_compaction is computed as not NOT_CLUSTER_ONLY and should_compact, but the MCMR test never waits for compaction after all batches are written — it only waits mid-loop when records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS. If the total record count is under MIN_RECORDS_BETWEEN_COMPACTION_WAITS (e.g., a single record from test_add_small with min_size=1) and should_compact=True, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified.

Fix: after the loop, add a final wait if should_wait_for_compaction is true and compaction has not yet been awaited:

if should_wait_for_compaction and records_since_compaction_wait > 0:
    current_version1 = wait_for_version_increase(
        client1, collection.name, current_version1
    )
    current_version2 = wait_for_version_increase(
        client2, collection.name, current_version2
    )
Context for Agents
The `database_name` fixture parameterizes tests in `test_add.py` to run against both classic and MCMR databases, but `_test_add` in `test_add_mcmr.py` always creates its own isolated database via `_create_isolated_database_mcmr` and ignores any incoming database context. This is intentional by design, but the `test_add_small` and `test_add_medium` functions in this file accept no `database_name` parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: `should_wait_for_compaction` is computed as `not NOT_CLUSTER_ONLY and should_compact`, but the MCMR test never waits for compaction *after all batches are written* — it only waits mid-loop when `records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS`. If the total record count is under `MIN_RECORDS_BETWEEN_COMPACTION_WAITS` (e.g., a single record from `test_add_small` with `min_size=1`) and `should_compact=True`, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified.

Fix: after the loop, add a final wait if `should_wait_for_compaction` is true and compaction has not yet been awaited:
```python
if should_wait_for_compaction and records_since_compaction_wait > 0:
    current_version1 = wait_for_version_increase(
        client1, collection.name, current_version1
    )
    current_version2 = wait_for_version_increase(
        client2, collection.name, current_version2
    )
```

File: chromadb/test/property/test_add_mcmr.py
Line: 157

collection_id = %collection_id,
topology_name = ?topology_name,
first_log_offset = rollup.start_log_position.offset(),
first_log_ts = rollup.start_log_position.offset(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Maintainability] first_log_ts = rollup.start_log_position.offset() logs the offset as the timestamp, but first_log_ts semantically should represent a wall-clock timestamp in microseconds (matching the CollectionInfo.first_log_ts field). The rollup's start_log_position.offset() is a log offset, not a timestamp. The correct field would be rollup.initial_insertion_epoch_us for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs.

Change to:

first_log_ts = rollup.initial_insertion_epoch_us,
Context for Agents
`first_log_ts = rollup.start_log_position.offset()` logs the offset as the timestamp, but `first_log_ts` semantically should represent a wall-clock timestamp in microseconds (matching the `CollectionInfo.first_log_ts` field). The rollup's `start_log_position.offset()` is a log offset, not a timestamp. The correct field would be `rollup.initial_insertion_epoch_us` for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs.

Change to:
```rust
first_log_ts = rollup.initial_insertion_epoch_us,
```

File: rust/log-service/src/lib.rs
Line: 1555

@blacksmith-sh

This comment has been minimized.

propel-code-bot[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes improve reliability and coverage, but two issues need follow-up: a dirty-log edge case at zero threshold and potential MCMR test flakiness from missing compaction waits.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions
  • Zero threshold may mark uninitialized logs perpetually dirty; add guard or ban zero configs: rust/wal3/src/interfaces/repl/manifest_manager.rs
  • MCMR property test may assert before compaction completes; force/document compaction wait: chromadb/test/property/test_add.py
Review Details

📁 15 files reviewed | 💬 2 comments

👍 / 👎 individual comments to help improve reviews for you

"TODO @jai, come back and debug why CI runners fail with async + sync"
)
_test_add(client, collection, record_set, True, always_compact=True)
_test_add(client, collection, record_set, True, always_compact=not MULTI_REGION_ENABLED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Testing] When MULTI_REGION_ENABLED is true, always_compact=False is passed to _test_add. The _test_add helper only forces a compaction wait when always_compact=True and no intermediate wait was triggered. With min_size=1, max_size=5, Hypothesis nearly always generates record sets smaller than MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10, so no intermediate wait fires. This means test_add_miniscule will routinely check invariants before compaction completes in MCMR mode, producing flaky test results.

If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing always_compact=True unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads.

Context for Agents
When `MULTI_REGION_ENABLED` is true, `always_compact=False` is passed to `_test_add`. The `_test_add` helper only forces a compaction wait when `always_compact=True` and no intermediate wait was triggered. With `min_size=1, max_size=5`, Hypothesis nearly always generates record sets smaller than `MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10`, so no intermediate wait fires. This means `test_add_miniscule` will routinely check invariants before compaction completes in MCMR mode, producing flaky test results.

If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing `always_compact=True` unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads.

File: chromadb/test/property/test_add.py
Line: 71

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 ideally we also test compactions with mcmr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. I went to great lengths in this PR to make them work. This is an outdated hallucination. The only time we will not compact with mcmr is when fewer than 10 records are inserted.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required: one critical test logic bug and two reliability/performance test-flow issues should be fixed before merge.

Status: Action Required | Risk: High

Issues Identified & Suggestions
  • Fix undefined variable causing runtime NameError in add property test: chromadb/test/property/test_add.py
  • Add retry/backoff for cross-region collection fetch race after create: chromadb/test/property/test_add_mcmr.py
  • Move async skip before DB creation to prevent leaked resources: chromadb/test/property/test_add_mcmr.py
Review Details

📁 15 files reviewed | 💬 3 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +151 to +154
coll2 = client2.get_collection(
name=collection.name,
embedding_function=collection.embedding_function,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Reliability] In _test_add (mcmr), coll2 is retrieved with client2.get_collection(...) immediately after client1.create_collection(...). In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the get_collection call, causing a NotFoundError. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes).

Context for Agents
In `_test_add` (mcmr), `coll2` is retrieved with `client2.get_collection(...)` immediately after `client1.create_collection(...)`. In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the `get_collection` call, causing a `NotFoundError`. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes).

File: chromadb/test/property/test_add_mcmr.py
Line: 154

Comment on lines +254 to +260
if (
client1.get_settings().chroma_api_impl
== "chromadb.api.async_fastapi.AsyncFastAPI"
):
pytest.skip(
"TODO @jai, come back and debug why CI runners fail with async + sync"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Performance] In test_add_large (mcmr), the async/sync skip check (lines 254-260) is placed after _create_mcmr_clients() and _create_isolated_database_mcmr() have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the pytest.skip check between _create_mcmr_clients() and _create_isolated_database_mcmr() — the check requires client1 to call get_settings(), but the database creation should not happen if the test will be skipped.

Context for Agents
In `test_add_large` (mcmr), the async/sync skip check (lines 254-260) is placed *after* `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the `pytest.skip` check between `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` — the check requires `client1` to call `get_settings()`, but the database creation should not happen if the test will be skipped.

File: chromadb/test/property/test_add_mcmr.py
Line: 260

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found one minor testing recommendation to improve Hypothesis stability in a property test.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions
  • Add missing Hypothesis health-check suppression for shrinking reliability: chromadb/test/property/test_add_mcmr.py
Review Details

📁 15 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

fast=hypothesis.settings(max_examples=20),
),
)
def test_add_small(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Testing] test_add_small and test_add_medium do not have the suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...] setting. Because _test_add creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a HealthCheck.function_scoped_fixture error during shrinking. test_add_medium suppresses it but test_add_small does not. Add it to test_add_small's @settings.

Context for Agents
`test_add_small` and `test_add_medium` do not have the `suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...]` setting. Because `_test_add` creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a `HealthCheck.function_scoped_fixture` error during shrinking. `test_add_medium` suppresses it but `test_add_small` does not. Add it to `test_add_small`'s `@settings`.

File: chromadb/test/property/test_add_mcmr.py
Line: 43

@blacksmith-sh

This comment has been minimized.

propel-code-bot[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid enhancement with one recommended test-coverage improvement for MCMR write-path validation.

Status: Minor Suggestions | Risk: Low

Issues Identified & Suggestions
  • Single-batch cases may miss region2→region1 replication path; force both writes. chromadb/test/property/test_add_mcmr.py
Review Details

📁 15 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +165 to +175
create_batches(
api=client1,
ids=cast(List[str], record_set["ids"]),
embeddings=cast(Embeddings, record_set["embeddings"]),
metadatas=cast(Metadatas, record_set["metadatas"]),
documents=cast(List[str], record_set["documents"]),
)
)
for batch_index, batch in enumerate(batches):
if batch_index % 2 == 0:
coll1.add(*batch)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Testing] The alternating write pattern (batch_index % 2 == 0coll1, else → coll2) means that if Hypothesis generates a record_set with only one batch (which is common for min_size=1), all writes go to coll1 and coll2 never receives any writes in that example. The cross-region invariant check on coll2 then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases.

For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to coll2 for odd-indexed examples, to ensure both write paths are exercised.

Context for Agents
The alternating write pattern (`batch_index % 2 == 0` → `coll1`, else → `coll2`) means that if Hypothesis generates a `record_set` with only one batch (which is common for `min_size=1`), all writes go to `coll1` and `coll2` never receives any writes in that example. The cross-region invariant check on `coll2` then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases.

For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to `coll2` for odd-indexed examples, to ensure both write paths are exercised.

File: chromadb/test/property/test_add_mcmr.py
Line: 175

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that as odds are we'll have two batches.

@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is strong overall, but one important reliability concern was identified in retry classification.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions
  • Reconsider retrying CANCELLED; treat as terminal unless explicitly safe. rust/rust-sysdb/src/types.rs
Review Details

📁 18 files reviewed | 💬 1 comments

👍 / 👎 individual comments to help improve reviews for you

Comment on lines +1374 to +1376
pub(crate) fn is_retryable_spanner_status(&self) -> bool {
self.try_as().is_some_and(|status| {
matches!(status.code(), tonic::Code::Aborted | tonic::Code::Cancelled)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Reliability] is_retryable_spanner_status in rust/rust-sysdb/src/types.rs matches tonic::Code::Cancelled as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats ABORTED as the standard retry signal for read-write transactions; CANCELLED usually indicates intentional cancellation by the caller. Consider whether Cancelled should be retried here or handled as a terminal error.

Context for Agents
`is_retryable_spanner_status` in `rust/rust-sysdb/src/types.rs` matches `tonic::Code::Cancelled` as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats `ABORTED` as the standard retry signal for read-write transactions; `CANCELLED` usually indicates intentional cancellation by the caller. Consider whether `Cancelled` should be retried here or handled as a terminal error.

File: rust/rust-sysdb/src/types.rs
Line: 1376

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues; changes appear reliable, well-tested, and low risk.

Status: No Issues Found | Risk: Low

Review Details

📁 16 files reviewed | 💬 0 comments

Copy link
Copy Markdown
Contributor Author

@rescrv rescrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a self review to cut most of the review code.

// Use zero threshold in integration tests to surface small repl-dirty changes
// immediately. The default threshold (100) hides small writes.
record_count_threshold: 0,
record_count_threshold: 1,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for mcmr because 0 returns every collection.

.with_max_times(4);
let db_name_for_log = req.name.clone().into_string();

// Use a read-write transaction to atomically check tenant, check database, and insert
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is largely an indent.

if let Some(ref database_name) = filter.database_name {
stmt.add_param("database_name", &database_name.as_ref());
}
let backoff = ConstantBuilder::default()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another backoff loop getting indented.

let result = self
.client
.read_write_transaction::<(), SysDbError, _>(|tx| {
let result = (|| {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another backof/retry

@blacksmith-sh

This comment has been minimized.

@rescrv rescrv requested a review from tanujnay112 April 8, 2026 17:49
@rescrv rescrv changed the base branch from rescrv/partial_manifest_read to main April 9, 2026 17:42
@rescrv rescrv force-pushed the rescrv/mcmr-tests branch from 4c35d83 to 18e0a5b Compare April 9, 2026 17:42
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

Introduce dedicated multi-cluster multi-region (MCMR) property tests
that write records across two regions and verify cross-region
replication invariants.  Parameterize existing test_add tests by
database name so they run against both classic and MCMR databases
when multi-region is enabled.

Fix dirty log detection in the repl manifest manager:
- Treat intrinsic_cursor=0 as unset so newly initialized logs are
  not spuriously reported as dirty
- Initialize manifest_regions rows with intrinsic_cursor=0 instead
  of omitting the column, preventing GC from misinterpreting the
  starting offset as an active reader position
- Change the SQL threshold comparison from > to >= so logs at the
  exact threshold boundary are correctly included
- Add decode_intrinsic_cursor helper that filters out zero values

Retry Spanner-aborted flush compaction transactions:
- Add is_spanner_aborted() to SysDbError to detect gRPC Aborted
  status
- Wrap the flush_collection_compaction read-write transaction with
  backon retry (100ms delay, 4 attempts) on Spanner abort
- Propagate gRPC status codes through SysDbError and
  FlushCompactionError instead of mapping all Spanner errors to
  Internal

Improve dirty log test diagnostics by dumping in-memory and Spanner
state on assertion failure.  Add regression tests for single-add and
three-add dirty log scenarios.

Co-authored-by: AI
@rescrv rescrv force-pushed the rescrv/mcmr-tests branch from 18e0a5b to bf7c8e8 Compare April 9, 2026 23:07
@rescrv rescrv requested review from sanketkedia and removed request for tanujnay112 April 9, 2026 23:34
kubectl -n chroma port-forward svc/rust-log-service 50054:50051 &
kubectl -n chroma port-forward svc/query-service 50053:50051 &
kubectl -n chroma port-forward svc/rust-frontend-service 8000:8000 &
kubectl -n chroma2 port-forward svc/rust-frontend-service 8001:8000 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what happens in single region case here. If chroma2 FE is not brought up then does this port forward fail?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing tries to connect it will do nothing. When they try to connect it'll error.

distance_function = distance_functions.l2

accuracy_threshold = 1e-6
assert collection.metadata is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what was the reason behind this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was none. I made a change above to make sure it's always not none.

parent=override_hypothesis_profile(
normal=hypothesis.settings(max_examples=500),
fast=hypothesis.settings(max_examples=200),
normal=hypothesis.settings(max_examples=100),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these to make ci faster?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It took 2 hours to run this test without cutting examples because we inflated things with mcmr and it's more than twice as expensive.

_test_add(client, collection, record_set, should_compact)


@multi_region_test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the point of this decorator was to enable selective tests to be mcmr. Are we now introducing a new way of doing this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't doing the trick. Things were still going through the classic database. Adding a database_name as a fixture did it.

)
def test_add_medium(
client: ClientAPI,
database_name: str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this param is not used anywhere below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fixture. Not needed to be used to configure the test to use one database name or the other. It's implicit magic. I can document.

)
def test_add_small(
client: ClientAPI,
database_name: str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this param is not used anywhere below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit above.

@sanketkedia
Copy link
Copy Markdown
Contributor

Reducing the number of examples might warrant a broader discussion with the team because I remember we had increased it in the past and it caught quite a few failures because of the increase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants