refactor(storage): clean up RAG storage abstraction and startup migrations#359
refactor(storage): clean up RAG storage abstraction and startup migrations#359sqhyz55 wants to merge 21 commits into
Conversation
rogercloud
left a comment
There was a problem hiding this comment.
Review Summary
The storage abstraction refactoring is well-structured and clean. Moving cascade logic into lancedb_cascade_vector_plane.py, extracting migration logic from app.py, and delegating to contract methods consistently improves separation of concerns. All existing tests pass.
Test Coverage Regression
test_list_candidates.py was reduced from 10 tests to 5. The removed tests covered code paths that still exist in list_candidates.py:
test_state_filter—list_candidates.py:314:if state is not Nonefiltertest_limit_filter—list_candidates.py:327:candidates = candidates[:limit]truncationtest_sort_before_limit—list_candidates.py:321-324:order_bysorting before limittest_model_tag_filter_for_embeddings— partially covered bytest_embed_candidates_require_model_tag, but the actual passing ofmodel_tagtostore.list_version_candidates()is no longer verified
These tests should be re-added, adapted to the new mock pattern using Mock() for get_vector_index_store.
2c49b0a to
abfac9f
Compare
rogercloud
left a comment
There was a problem hiding this comment.
Re-review Summary
Three issues from the previous review are still not addressed, and one new bug was found.
Still Not Fixed (inline)
- Missing
test_state_filter—list_candidates.py:314-315hasif state is not None:filter logic with zero test coverage. - Missing
test_limit_filter—list_candidates.py:327-328hascandidates[:limit]truncation with zero test coverage. - Missing
test_sort_before_limit—list_candidates.py:321-324hasorder_bysort logic with zero test coverage.
Fixed
test_model_tag_filter_for_embeddings— ✅ Resolved.test_embed_candidates_from_store_rows:158-162now verifiesmodel_tagis passed tostore.list_version_candidates().
New Bug (inline)
- Migration service background task gating skips documents backfill —
_should_start_background_task()only checks for user_id field migration (viacheck_table_needs_migration). When all tables already haveuser_idcolumns, it returnsFalse, sostart_background_migrationsreturnsNoneand_check_and_backfill_documents_tablenever executes — even when the documents table has emptyfile_idvalues orNULLuser_idvalues that need backfilling.
Suggestion
Add the three missing test cases to test_list_candidates.py using the new mock pattern (Mock() + patch("get_vector_index_store")). For the migration service, either extend _should_start_background_task to check for documents backfill need as well, or run the documents backfill check independently of the user_id migration gate.
d53354c to
8bc2f18
Compare
rogercloud
left a comment
There was a problem hiding this comment.
Please address the resource leak in the new FTS-by-model helper before merging.
open_embeddings_table returns a live handle used only to resolve table_name before search_* reopens by name; release it in finally to avoid LanceDB handle leaks (PR xorbitsai#359 review).
rogercloud
left a comment
There was a problem hiding this comment.
Submitting validated review finding 1/5.
rogercloud
left a comment
There was a problem hiding this comment.
Submitting validated review finding 2/5.
rogercloud
left a comment
There was a problem hiding this comment.
Submitting validated review finding 3/5.
rogercloud
left a comment
There was a problem hiding this comment.
Submitting validated review finding 4/5.
rogercloud
left a comment
There was a problem hiding this comment.
Submitting validated review finding 5/5.
…-A1) Remove raw LanceDB document queries from kb_file_service; use list_document_records with DocumentRecord.collection for multi-table scans. Add web unit test asserting delegation without LanceDB connection helpers.
… store (PR-A2) Expose VectorIndexStore.search_fts / search_fts_by_model with tenant filters; implement in LanceDBVectorIndexStore via query_to_list. Refactor search_sparse and sync substring fallback to use iter_batches; update retrieval tests to mock the store contract instead of raw table.search FTS.
… flows (PR-A3) Move predicate planning/deletion mechanics to a dedicated LanceDB vector-plane module, and add VectorIndexStore cascade APIs for adapter-level ownership. Keep cascade_cleaner focused on domain predicate construction and orchestration.
…tore (PR-A4) Add list_version_candidates to the storage contract and implement it in the LanceDB adapter. Update list_candidates to consume contract-returned rows instead of raw connection/table APIs, and migrate tests to contract-level mocks.
Extract storage migration checks and backfill orchestration from app startup into RAGStorageMigrationService, keeping app.py backend-agnostic at the boundary. Add focused service tests for background task scheduling and auto-migrate gating.
Remove LanceDB-specific collection_metadata schema/DDL logic from collection_manager and route table initialization through MetadataStore.ensure_collection_metadata_table. Add a manager test to verify metadata DDL delegation behavior.
Avoid mypy union-attr false positives for LanceDB table handles in background document backfill checks by using Optional[Any].
…lures Propagate VersionManagementError for invalid step/model inputs in list_candidates, while wrapping only backend query failures as DatabaseOperationError. Improve migration service logging by separating missing-table debug cases from real backfill failures and add focused tests for both paths.
Clean up leftover branch edits by removing unused imports and normalizing line wrapping in service and test files for consistent formatting.
…ability Optimize sparse fallback and cascade predicate construction to avoid unnecessary LanceDB table opens, add explicit handle close in iter_batches to reduce FD pressure, and improve structured logging for FTS/cascade partial failures.
Only create startup migration background tasks when migration is actually needed, and align tests with the new contract and stricter access filtering behavior. This also restores sparse fallback table resolution compatibility and hardens warning-log assertions to avoid flaky failures.
… coverage Add contract-level tests for state filtering, limit truncation, sort-before-limit behavior, and embed model_tag forwarding so the storage-abstraction path keeps the same behavioral guarantees after refactoring.
…tsai#362) Add user_id/is_admin to async vector-search contract and pass them through search_vectors_by_model_async into the LanceDB adapter filter builder. Add regression tests to verify tenant scope forwarding for both convenience and adapter async APIs.
Add direct unit coverage for sync vector/fts/batch/count APIs, ingestion status store sync+async paths, and sparse async retrieval fallback/error behavior to close previously identified storage-layer test gaps.
Move DatabaseOperationError to module-level imports in contracts, replace cross-module private helper usage with a public merge-error classifier, and extract a shared Arrow-table conversion helper to remove duplicated async search code.
Restore startup behavior to always create the background migration task under LANCEDB_AUTO_MIGRATE so documents backfill checks still run even when user_id schema migration is not required.
Extend VectorIndexStore.list_document_records with an optional file_ids parameter so callers can retrieve document records for specific uploaded files without constructing backend-specific filter SQL. Co-authored-by: Cursor <cursoragent@cursor.com>
Rewrite aggregate_uploaded_file_statuses, _load_indexed_doc_refs, and reconcile_uploaded_files in kb_file_service to use the storage abstraction layer instead of direct LanceDB calls. This removes all raw LanceDB imports (get_connection_from_env, ensure_documents_table, query_to_list, build_lancedb_filter_expression, escape_lancedb_string, UserPermissions, list_embeddings_table_names, _safe_close_table) from the web service layer. The new list_document_records file_ids parameter replaces hand-built file_id IN SQL filters, and aggregate_document_counts replaces direct table scans for the indexed-artifact fallback check. Also fix auto_migrate variable lost during rebase in app.py startup.
open_embeddings_table returns a live handle used only to resolve table_name before search_* reopens by name; release it in finally to avoid LanceDB handle leaks (PR xorbitsai#359 review).
…migrations Restore per-embeddings-table cascade predicates, pin route collection in sparse fallback scans, forward user scope through async FTS, batch unbounded file-status document lookups, and always schedule migration compatibility checks while keeping backfill gated by LANCEDB_AUTO_MIGRATE.
9502e3d to
d4204a1
Compare
…y task When LANCEDB_AUTO_MIGRATE is false, startup still schedules a compatibility check task but must not run backfill_all; update the integration assertion accordingly.
No description provided.