[ENH] add MCMR property tests and fix dirty log detection#6835
[ENH] add MCMR property tests and fix dirty log detection#6835
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Add MCMR Property Coverage and Harden Dirty-Log/Spanner Reliability Paths This PR is a major enhancement focused on multi-region correctness and replication reliability. It introduces dedicated MCMR property tests and expands existing property tests to run across database modes, while updating CI/workflows to support reproducible cluster-based execution. The goal is to validate cross-region invariants more thoroughly and catch replication edge cases earlier. In parallel, the PR fixes dirty-log detection semantics in WAL manifest management and improves Spanner transaction robustness by adding retry behavior for abort/cancel scenarios and preserving gRPC status codes through error layers. It also adds regression tests for dirty-log scenarios, null-safe invariant handling in Python tests, structured WAL logging, and worker config updates for This summary was automatically generated by @propel-code-bot |
There was a problem hiding this comment.
Review found minor, recommended fixes to improve reliability and test resource management without blocking the PR.
Status: Minor Suggestions | Risk: Medium
Issues Identified & Suggestions
- Handle and log reader iteration errors to avoid silent truncation:
rust/log-service/src/lib.rs - Close per-example HTTP clients to prevent leaked systems/connections:
chromadb/test/property/test_add_mcmr.py
Review Details
📁 13 files reviewed | 💬 2 comments
👍 / 👎 individual comments to help improve reviews for you
rust/log-service/src/lib.rs
Outdated
| async fn dump_collectible_state(server: &LogServer, db_name: &str) -> String { | ||
| let mut dump = String::new(); | ||
| { | ||
| let need_to_compact_repl = server.need_to_compact_repl.lock(); | ||
| dump.push_str("in-memory need_to_compact_repl:\n"); | ||
| for ((topology, collection_id), rollup) in need_to_compact_repl.iter() { | ||
| dump.push_str(&format!( | ||
| " topology={topology} collection_id={collection_id} start={} limit={} reinsert_count={} initial_insertion_epoch_us={}\n", | ||
| rollup.start_log_position.offset(), | ||
| rollup.limit_log_position.offset(), | ||
| rollup.reinsert_count, | ||
| rollup.initial_insertion_epoch_us, | ||
| )); | ||
| } | ||
| } | ||
|
|
||
| let Some(topology_name) = DatabaseName::new(db_name) | ||
| .and_then(|db| db.topology()) | ||
| .and_then(|topology| TopologyName::new(topology).ok()) |
There was a problem hiding this comment.
[Reliability] In dump_collectible_state, the raw Spanner query is executed using topology.config.spanner.read_only_transaction() but the transaction handle tx is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the reader borrows from tx and both must stay alive for the while let Ok(Some(row)) = reader.next().await loop. If reader.next() returns an Err variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to:
while let Some(row) = reader.next().await.transpose() {
match row { ... }
}or at minimum logging the error variant so diagnostic dumps aren't silently truncated.
Context for Agents
In `dump_collectible_state`, the raw Spanner query is executed using `topology.config.spanner.read_only_transaction()` but the transaction handle `tx` is never explicitly committed or closed — it is just dropped at end of scope. For read-only transactions in the Spanner client library this is typically fine, but the `reader` borrows from `tx` and both must stay alive for the `while let Ok(Some(row)) = reader.next().await` loop. If `reader.next()` returns an `Err` variant, the loop exits silently and the remaining rows are discarded with no log entry. Consider changing the loop to:
```rust
while let Some(row) = reader.next().await.transpose() {
match row { ... }
}
```
or at minimum logging the error variant so diagnostic dumps aren't silently truncated.
File: rust/log-service/src/lib.rs
Line: 5233| record_set: strategies.RecordSet, | ||
| should_compact: bool, | ||
| ) -> None: | ||
| _test_add(collection, record_set, should_compact) |
There was a problem hiding this comment.
[Performance] test_add_small, test_add_medium, and test_add_large call _create_mcmr_clients() on every Hypothesis example. Each HttpClient() call creates a new System with a unique UUID identifier (see SharedSystemClient._get_identifier_from_settings), and these Systems accumulate in the static _identifier_to_system dict without cleanup. With max_examples=50 for test_add_small, that's 100 System instances with HTTP connections that are never released.
At minimum, add explicit cleanup at the end of _test_add() and test_add_large():
client1.close()
client2.close()This matches the cleanup pattern used elsewhere in the test suite (e.g., test_client.py calls clear_system_cache(), test_restart_persist.py calls system.stop()).
Context for Agents
`test_add_small`, `test_add_medium`, and `test_add_large` call `_create_mcmr_clients()` on every Hypothesis example. Each `HttpClient()` call creates a new `System` with a unique UUID identifier (see `SharedSystemClient._get_identifier_from_settings`), and these Systems accumulate in the static `_identifier_to_system` dict without cleanup. With `max_examples=50` for `test_add_small`, that's 100 System instances with HTTP connections that are never released.
At minimum, add explicit cleanup at the end of `_test_add()` and `test_add_large()`:
```python
client1.close()
client2.close()
```
This matches the cleanup pattern used elsewhere in the test suite (e.g., `test_client.py` calls `clear_system_cache()`, `test_restart_persist.py` calls `system.stop()`).
File: chromadb/test/property/test_add_mcmr.py
Line: 48
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Solid enhancement overall, with one recommended testing reliability fix for asynchronous replication timing.
Status: Minor Suggestions | Risk: Medium
Issues Identified & Suggestions
- Add wait/replication check before cross-region invariant assertions when no compaction:
chromadb/test/property/test_add_mcmr.py
Review Details
📁 13 files reviewed | 💬 1 comments
👍 / 👎 individual comments to help improve reviews for you
| coll, | ||
| cast(strategies.RecordSet, normalized_record_set), | ||
| n_results=n_results, | ||
| embedding_function=collection.embedding_function, | ||
| query_indices=list( | ||
| range(i, min(i + batch_size, len(normalized_record_set["ids"]))) | ||
| ), |
There was a problem hiding this comment.
[Testing] After all batches are written (alternating between coll1 and coll2), the invariants are checked on both collections without waiting for compaction when should_compact=False. The invariants.count and invariants.ann_accuracy checks on coll2 for data written via coll1 (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and should_compact=False, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing.
The existing test_add.py tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when should_compact=False.
Context for Agents
After all batches are written (alternating between `coll1` and `coll2`), the invariants are checked on both collections **without waiting for compaction** when `should_compact=False`. The `invariants.count` and `invariants.ann_accuracy` checks on `coll2` for data written via `coll1` (and vice versa) depend on cross-region replication having completed before the assertion. If replication is asynchronous and `should_compact=False`, there is no wait, meaning the test will assert on potentially un-replicated data. This will cause spurious failures or false-green results depending on timing.
The existing `test_add.py` tests use compaction waits specifically because the cluster needs time to propagate writes. In MCMR, cross-region propagation adds additional latency. Consider adding a minimum wait or compaction-version check even when `should_compact=False`.
File: chromadb/test/property/test_add_mcmr.py
Line: 205
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Review found one minor testing reliability suggestion: add runtime skip guards for MCMR-only tests to avoid CI connection failures in non-cluster environments.
Status: Minor Suggestions | Risk: Low
Issues Identified & Suggestions
- Add explicit MCMR runtime skip guard for non-cluster CI runs:
chromadb/test/property/test_add_mcmr.py
Review Details
📁 14 files reviewed | 💬 1 comments
👍 / 👎 individual comments to help improve reviews for you
| @given( | ||
| collection=collection_st, | ||
| record_set=strategies.recordsets(collection_st, min_size=1, max_size=500), | ||
| should_compact=st.booleans(), | ||
| ) | ||
| @settings( | ||
| deadline=None, | ||
| parent=override_hypothesis_profile( | ||
| normal=hypothesis.settings(max_examples=50), | ||
| fast=hypothesis.settings(max_examples=20), | ||
| ), | ||
| ) | ||
| def test_add_small( | ||
| collection: strategies.Collection, | ||
| record_set: strategies.RecordSet, | ||
| should_compact: bool, | ||
| ) -> None: | ||
| _test_add(collection, record_set, should_compact) |
There was a problem hiding this comment.
[Testing] test_add_small and test_add_medium in this file take no client fixture and unconditionally call _create_mcmr_clients() which connects to localhost:8000 and localhost:8001. Unlike the tests in test_add.py, there is no guard checking NOT_CLUSTER_ONLY or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped.
The CI workflow excludes test_add_mcmr.py from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in _create_mcmr_clients() to fail fast with a clear message:
def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]:
if NOT_CLUSTER_ONLY:
pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)")
...Context for Agents
`test_add_small` and `test_add_medium` in this file take no `client` fixture and unconditionally call `_create_mcmr_clients()` which connects to `localhost:8000` and `localhost:8001`. Unlike the tests in `test_add.py`, there is no guard checking `NOT_CLUSTER_ONLY` or skipping when running in a non-cluster environment. If these tests are collected in a CI environment without the MCMR ports forwarded (e.g., the standard Python property test matrix), they will fail with a connection error rather than being skipped.
The CI workflow excludes `test_add_mcmr.py` from the standard property test glob, but relies on the file-level exclusion rather than a runtime skip. Add an explicit skip guard at the top of the module or in `_create_mcmr_clients()` to fail fast with a clear message:
```python
def _create_mcmr_clients() -> Tuple[ClientAPI, ClientAPI]:
if NOT_CLUSTER_ONLY:
pytest.skip("MCMR tests require a running multi-region cluster (CHROMA_CLUSTER_TEST_ONLY=1)")
...
```
File: chromadb/test/property/test_add_mcmr.py
Line: 48
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Strong MCMR and logging improvements, but key correctness and observability fixes are still needed before merge.
Status: Changes Suggested | Risk: Medium
Issues Identified & Suggestions
- Add final compaction wait to verify replication on small writes:
chromadb/test/property/test_add_mcmr.py - Use wall-clock insertion timestamp, not log offset, for first_log_ts:
rust/log-service/src/lib.rs
Review Details
📁 15 files reviewed | 💬 2 comments
👍 / 👎 individual comments to help improve reviews for you
| ) | ||
|
|
||
| normalized_record_set = invariants.wrap_all(record_set) | ||
| should_wait_for_compaction = not NOT_CLUSTER_ONLY and should_compact |
There was a problem hiding this comment.
[Testing] The database_name fixture parameterizes tests in test_add.py to run against both classic and MCMR databases, but _test_add in test_add_mcmr.py always creates its own isolated database via _create_isolated_database_mcmr and ignores any incoming database context. This is intentional by design, but the test_add_small and test_add_medium functions in this file accept no database_name parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: should_wait_for_compaction is computed as not NOT_CLUSTER_ONLY and should_compact, but the MCMR test never waits for compaction after all batches are written — it only waits mid-loop when records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS. If the total record count is under MIN_RECORDS_BETWEEN_COMPACTION_WAITS (e.g., a single record from test_add_small with min_size=1) and should_compact=True, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified.
Fix: after the loop, add a final wait if should_wait_for_compaction is true and compaction has not yet been awaited:
if should_wait_for_compaction and records_since_compaction_wait > 0:
current_version1 = wait_for_version_increase(
client1, collection.name, current_version1
)
current_version2 = wait_for_version_increase(
client2, collection.name, current_version2
)Context for Agents
The `database_name` fixture parameterizes tests in `test_add.py` to run against both classic and MCMR databases, but `_test_add` in `test_add_mcmr.py` always creates its own isolated database via `_create_isolated_database_mcmr` and ignores any incoming database context. This is intentional by design, but the `test_add_small` and `test_add_medium` functions in this file accept no `database_name` parameter, so they will always run in the MCMR context regardless of CI configuration. The bigger concern: `should_wait_for_compaction` is computed as `not NOT_CLUSTER_ONLY and should_compact`, but the MCMR test never waits for compaction *after all batches are written* — it only waits mid-loop when `records_since_compaction_wait >= MIN_RECORDS_BETWEEN_COMPACTION_WAITS`. If the total record count is under `MIN_RECORDS_BETWEEN_COMPACTION_WAITS` (e.g., a single record from `test_add_small` with `min_size=1`) and `should_compact=True`, no compaction wait ever fires. The invariant checks then run without confirming the data was compacted and replicated, so the cross-region replication invariant is not actually verified.
Fix: after the loop, add a final wait if `should_wait_for_compaction` is true and compaction has not yet been awaited:
```python
if should_wait_for_compaction and records_since_compaction_wait > 0:
current_version1 = wait_for_version_increase(
client1, collection.name, current_version1
)
current_version2 = wait_for_version_increase(
client2, collection.name, current_version2
)
```
File: chromadb/test/property/test_add_mcmr.py
Line: 157
rust/log-service/src/lib.rs
Outdated
| collection_id = %collection_id, | ||
| topology_name = ?topology_name, | ||
| first_log_offset = rollup.start_log_position.offset(), | ||
| first_log_ts = rollup.start_log_position.offset(), |
There was a problem hiding this comment.
[Maintainability] first_log_ts = rollup.start_log_position.offset() logs the offset as the timestamp, but first_log_ts semantically should represent a wall-clock timestamp in microseconds (matching the CollectionInfo.first_log_ts field). The rollup's start_log_position.offset() is a log offset, not a timestamp. The correct field would be rollup.initial_insertion_epoch_us for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs.
Change to:
first_log_ts = rollup.initial_insertion_epoch_us,Context for Agents
`first_log_ts = rollup.start_log_position.offset()` logs the offset as the timestamp, but `first_log_ts` semantically should represent a wall-clock timestamp in microseconds (matching the `CollectionInfo.first_log_ts` field). The rollup's `start_log_position.offset()` is a log offset, not a timestamp. The correct field would be `rollup.initial_insertion_epoch_us` for the wall-clock insertion time. This makes the log misleading for operators trying to correlate log service outputs with compactor scheduler logs.
Change to:
```rust
first_log_ts = rollup.initial_insertion_epoch_us,
```
File: rust/log-service/src/lib.rs
Line: 1555
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Changes improve reliability and coverage, but two issues need follow-up: a dirty-log edge case at zero threshold and potential MCMR test flakiness from missing compaction waits.
Status: Changes Suggested | Risk: Medium
Issues Identified & Suggestions
- Zero threshold may mark uninitialized logs perpetually dirty; add guard or ban zero configs:
rust/wal3/src/interfaces/repl/manifest_manager.rs - MCMR property test may assert before compaction completes; force/document compaction wait:
chromadb/test/property/test_add.py
Review Details
📁 15 files reviewed | 💬 2 comments
👍 / 👎 individual comments to help improve reviews for you
| "TODO @jai, come back and debug why CI runners fail with async + sync" | ||
| ) | ||
| _test_add(client, collection, record_set, True, always_compact=True) | ||
| _test_add(client, collection, record_set, True, always_compact=not MULTI_REGION_ENABLED) |
There was a problem hiding this comment.
[Testing] When MULTI_REGION_ENABLED is true, always_compact=False is passed to _test_add. The _test_add helper only forces a compaction wait when always_compact=True and no intermediate wait was triggered. With min_size=1, max_size=5, Hypothesis nearly always generates record sets smaller than MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10, so no intermediate wait fires. This means test_add_miniscule will routinely check invariants before compaction completes in MCMR mode, producing flaky test results.
If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing always_compact=True unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads.
Context for Agents
When `MULTI_REGION_ENABLED` is true, `always_compact=False` is passed to `_test_add`. The `_test_add` helper only forces a compaction wait when `always_compact=True` and no intermediate wait was triggered. With `min_size=1, max_size=5`, Hypothesis nearly always generates record sets smaller than `MIN_RECORDS_BETWEEN_COMPACTION_WAITS=10`, so no intermediate wait fires. This means `test_add_miniscule` will routinely check invariants before compaction completes in MCMR mode, producing flaky test results.
If there's a reason not to wait for compaction in MCMR, that should be documented. Otherwise consider passing `always_compact=True` unconditionally, or gating the MCMR database parameter on tests that can tolerate pre-compaction reads.
File: chromadb/test/property/test_add.py
Line: 71There was a problem hiding this comment.
+1 ideally we also test compactions with mcmr
There was a problem hiding this comment.
We do. I went to great lengths in this PR to make them work. This is an outdated hallucination. The only time we will not compact with mcmr is when fewer than 10 records are inserted.
There was a problem hiding this comment.
Action required: one critical test logic bug and two reliability/performance test-flow issues should be fixed before merge.
Status: Action Required | Risk: High
Issues Identified & Suggestions
- Fix undefined variable causing runtime NameError in add property test:
chromadb/test/property/test_add.py - Add retry/backoff for cross-region collection fetch race after create:
chromadb/test/property/test_add_mcmr.py - Move async skip before DB creation to prevent leaked resources:
chromadb/test/property/test_add_mcmr.py
Review Details
📁 15 files reviewed | 💬 3 comments
👍 / 👎 individual comments to help improve reviews for you
| coll2 = client2.get_collection( | ||
| name=collection.name, | ||
| embedding_function=collection.embedding_function, | ||
| ) |
There was a problem hiding this comment.
[Reliability] In _test_add (mcmr), coll2 is retrieved with client2.get_collection(...) immediately after client1.create_collection(...). In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the get_collection call, causing a NotFoundError. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes).
Context for Agents
In `_test_add` (mcmr), `coll2` is retrieved with `client2.get_collection(...)` immediately after `client1.create_collection(...)`. In a multi-region setup with asynchronous replication, the collection may not yet exist on region 2 at the moment of the `get_collection` call, causing a `NotFoundError`. There is no retry or wait before this call. Either add a retry-with-backoff here, or create the collection independently on both clients (which avoids the race entirely for test isolation purposes).
File: chromadb/test/property/test_add_mcmr.py
Line: 154| if ( | ||
| client1.get_settings().chroma_api_impl | ||
| == "chromadb.api.async_fastapi.AsyncFastAPI" | ||
| ): | ||
| pytest.skip( | ||
| "TODO @jai, come back and debug why CI runners fail with async + sync" | ||
| ) |
There was a problem hiding this comment.
[Performance] In test_add_large (mcmr), the async/sync skip check (lines 254-260) is placed after _create_mcmr_clients() and _create_isolated_database_mcmr() have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the pytest.skip check between _create_mcmr_clients() and _create_isolated_database_mcmr() — the check requires client1 to call get_settings(), but the database creation should not happen if the test will be skipped.
Context for Agents
In `test_add_large` (mcmr), the async/sync skip check (lines 254-260) is placed *after* `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` have already been called (lines 251-252). This means that on async runners, the database is created in Spanner before the test is skipped, leaking the resource. Move the `pytest.skip` check between `_create_mcmr_clients()` and `_create_isolated_database_mcmr()` — the check requires `client1` to call `get_settings()`, but the database creation should not happen if the test will be skipped.
File: chromadb/test/property/test_add_mcmr.py
Line: 260There was a problem hiding this comment.
Review found one minor testing recommendation to improve Hypothesis stability in a property test.
Status: Minor Suggestions | Risk: Low
Issues Identified & Suggestions
- Add missing Hypothesis health-check suppression for shrinking reliability:
chromadb/test/property/test_add_mcmr.py
Review Details
📁 15 files reviewed | 💬 1 comments
👍 / 👎 individual comments to help improve reviews for you
| fast=hypothesis.settings(max_examples=20), | ||
| ), | ||
| ) | ||
| def test_add_small( |
There was a problem hiding this comment.
[Testing] test_add_small and test_add_medium do not have the suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...] setting. Because _test_add creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a HealthCheck.function_scoped_fixture error during shrinking. test_add_medium suppresses it but test_add_small does not. Add it to test_add_small's @settings.
Context for Agents
`test_add_small` and `test_add_medium` do not have the `suppress_health_check=[..., hypothesis.HealthCheck.function_scoped_fixture, ...]` setting. Because `_test_add` creates HTTP clients and an isolated database inside the test body (which Hypothesis treats as function-scoped state modified across shrinking attempts), Hypothesis may raise a `HealthCheck.function_scoped_fixture` error during shrinking. `test_add_medium` suppresses it but `test_add_small` does not. Add it to `test_add_small`'s `@settings`.
File: chromadb/test/property/test_add_mcmr.py
Line: 43
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Overall solid enhancement with one recommended test-coverage improvement for MCMR write-path validation.
Status: Minor Suggestions | Risk: Low
Issues Identified & Suggestions
- Single-batch cases may miss region2→region1 replication path; force both writes.
chromadb/test/property/test_add_mcmr.py
Review Details
📁 15 files reviewed | 💬 1 comments
👍 / 👎 individual comments to help improve reviews for you
| create_batches( | ||
| api=client1, | ||
| ids=cast(List[str], record_set["ids"]), | ||
| embeddings=cast(Embeddings, record_set["embeddings"]), | ||
| metadatas=cast(Metadatas, record_set["metadatas"]), | ||
| documents=cast(List[str], record_set["documents"]), | ||
| ) | ||
| ) | ||
| for batch_index, batch in enumerate(batches): | ||
| if batch_index % 2 == 0: | ||
| coll1.add(*batch) |
There was a problem hiding this comment.
[Testing] The alternating write pattern (batch_index % 2 == 0 → coll1, else → coll2) means that if Hypothesis generates a record_set with only one batch (which is common for min_size=1), all writes go to coll1 and coll2 never receives any writes in that example. The cross-region invariant check on coll2 then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases.
For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to coll2 for odd-indexed examples, to ensure both write paths are exercised.
Context for Agents
The alternating write pattern (`batch_index % 2 == 0` → `coll1`, else → `coll2`) means that if Hypothesis generates a `record_set` with only one batch (which is common for `min_size=1`), all writes go to `coll1` and `coll2` never receives any writes in that example. The cross-region invariant check on `coll2` then only validates that an empty write on one side replicates zero records — it does not exercise the "data written to region 2 appears in region 1" path at all for those cases.
For single-batch record sets, consider always writing to both regions (split the batch), or writing entirely to `coll2` for odd-indexed examples, to ensure both write paths are exercised.
File: chromadb/test/property/test_add_mcmr.py
Line: 175There was a problem hiding this comment.
I'm fine with that as odds are we'll have two batches.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
The PR is strong overall, but one important reliability concern was identified in retry classification.
Status: Changes Suggested | Risk: Medium
Issues Identified & Suggestions
- Reconsider retrying CANCELLED; treat as terminal unless explicitly safe.
rust/rust-sysdb/src/types.rs
Review Details
📁 18 files reviewed | 💬 1 comments
👍 / 👎 individual comments to help improve reviews for you
| pub(crate) fn is_retryable_spanner_status(&self) -> bool { | ||
| self.try_as().is_some_and(|status| { | ||
| matches!(status.code(), tonic::Code::Aborted | tonic::Code::Cancelled) |
There was a problem hiding this comment.
[Reliability] is_retryable_spanner_status in rust/rust-sysdb/src/types.rs matches tonic::Code::Cancelled as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats ABORTED as the standard retry signal for read-write transactions; CANCELLED usually indicates intentional cancellation by the caller. Consider whether Cancelled should be retried here or handled as a terminal error.
Context for Agents
`is_retryable_spanner_status` in `rust/rust-sysdb/src/types.rs` matches `tonic::Code::Cancelled` as retryable. Cancelled typically means the client or a deadline cancelled the RPC — retrying a cancelled request may not be appropriate and could cause unexpected duplicate side-effects if the original transaction partially committed before cancellation. Spanner's own guidance treats `ABORTED` as the standard retry signal for read-write transactions; `CANCELLED` usually indicates intentional cancellation by the caller. Consider whether `Cancelled` should be retried here or handled as a terminal error.
File: rust/rust-sysdb/src/types.rs
Line: 1376
rescrv
left a comment
There was a problem hiding this comment.
Gave a self review to cut most of the review code.
| // Use zero threshold in integration tests to surface small repl-dirty changes | ||
| // immediately. The default threshold (100) hides small writes. | ||
| record_count_threshold: 0, | ||
| record_count_threshold: 1, |
There was a problem hiding this comment.
This is necessary for mcmr because 0 returns every collection.
| .with_max_times(4); | ||
| let db_name_for_log = req.name.clone().into_string(); | ||
|
|
||
| // Use a read-write transaction to atomically check tenant, check database, and insert |
There was a problem hiding this comment.
This block is largely an indent.
| if let Some(ref database_name) = filter.database_name { | ||
| stmt.add_param("database_name", &database_name.as_ref()); | ||
| } | ||
| let backoff = ConstantBuilder::default() |
There was a problem hiding this comment.
Another backoff loop getting indented.
| let result = self | ||
| .client | ||
| .read_write_transaction::<(), SysDbError, _>(|tx| { | ||
| let result = (|| { |
This comment has been minimized.
This comment has been minimized.
4c35d83 to
18e0a5b
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Introduce dedicated multi-cluster multi-region (MCMR) property tests that write records across two regions and verify cross-region replication invariants. Parameterize existing test_add tests by database name so they run against both classic and MCMR databases when multi-region is enabled. Fix dirty log detection in the repl manifest manager: - Treat intrinsic_cursor=0 as unset so newly initialized logs are not spuriously reported as dirty - Initialize manifest_regions rows with intrinsic_cursor=0 instead of omitting the column, preventing GC from misinterpreting the starting offset as an active reader position - Change the SQL threshold comparison from > to >= so logs at the exact threshold boundary are correctly included - Add decode_intrinsic_cursor helper that filters out zero values Retry Spanner-aborted flush compaction transactions: - Add is_spanner_aborted() to SysDbError to detect gRPC Aborted status - Wrap the flush_collection_compaction read-write transaction with backon retry (100ms delay, 4 attempts) on Spanner abort - Propagate gRPC status codes through SysDbError and FlushCompactionError instead of mapping all Spanner errors to Internal Improve dirty log test diagnostics by dumping in-memory and Spanner state on assertion failure. Add regression tests for single-add and three-add dirty log scenarios. Co-authored-by: AI
This reverts commit 0eebd27.
18e0a5b to
bf7c8e8
Compare
| kubectl -n chroma port-forward svc/rust-log-service 50054:50051 & | ||
| kubectl -n chroma port-forward svc/query-service 50053:50051 & | ||
| kubectl -n chroma port-forward svc/rust-frontend-service 8000:8000 & | ||
| kubectl -n chroma2 port-forward svc/rust-frontend-service 8001:8000 & |
There was a problem hiding this comment.
curious what happens in single region case here. If chroma2 FE is not brought up then does this port forward fail?
There was a problem hiding this comment.
If nothing tries to connect it will do nothing. When they try to connect it'll error.
| distance_function = distance_functions.l2 | ||
|
|
||
| accuracy_threshold = 1e-6 | ||
| assert collection.metadata is not None |
There was a problem hiding this comment.
curious what was the reason behind this change?
There was a problem hiding this comment.
Because it was none. I made a change above to make sure it's always not none.
| parent=override_hypothesis_profile( | ||
| normal=hypothesis.settings(max_examples=500), | ||
| fast=hypothesis.settings(max_examples=200), | ||
| normal=hypothesis.settings(max_examples=100), |
There was a problem hiding this comment.
are these to make ci faster?
There was a problem hiding this comment.
Yes. It took 2 hours to run this test without cutting examples because we inflated things with mcmr and it's more than twice as expensive.
| _test_add(client, collection, record_set, should_compact) | ||
|
|
||
|
|
||
| @multi_region_test |
There was a problem hiding this comment.
IIRC the point of this decorator was to enable selective tests to be mcmr. Are we now introducing a new way of doing this?
There was a problem hiding this comment.
It wasn't doing the trick. Things were still going through the classic database. Adding a database_name as a fixture did it.
| ) | ||
| def test_add_medium( | ||
| client: ClientAPI, | ||
| database_name: str, |
There was a problem hiding this comment.
nit: this param is not used anywhere below
There was a problem hiding this comment.
It's a fixture. Not needed to be used to configure the test to use one database name or the other. It's implicit magic. I can document.
| ) | ||
| def test_add_small( | ||
| client: ClientAPI, | ||
| database_name: str, |
There was a problem hiding this comment.
nit: this param is not used anywhere below
|
Reducing the number of examples might warrant a broader discussion with the team because I remember we had increased it in the past and it caught quite a few failures because of the increase |
Description of changes
Introduce dedicated multi-cluster multi-region (MCMR) property tests
that write records across two regions and verify cross-region
replication invariants. Parameterize existing test_add tests by
database name so they run against both classic and MCMR databases
when multi-region is enabled.
Fix dirty log detection in the repl manifest manager:
not spuriously reported as dirty
of omitting the column, preventing GC from misinterpreting the
starting offset as an active reader position
exact threshold boundary are correctly included
Retry Spanner-aborted flush compaction transactions:
status
backon retry (100ms delay, 4 attempts) on Spanner abort
FlushCompactionError instead of mapping all Spanner errors to
Internal
Improve dirty log test diagnostics by dumping in-memory and Spanner
state on assertion failure. Add regression tests for single-add and
three-add dirty log scenarios.
Test plan
CI
Migration plan
N/A
Observability plan
N/A
Documentation Changes
N/A
Co-authored-by: AI