Skip to content

refactor(storage): clean up RAG storage abstraction and startup migrations#359

Open
sqhyz55 wants to merge 21 commits into
xorbitsai:mainfrom
sqhyz55:feat/phase1a-storage-abstraction-cleanup
Open

refactor(storage): clean up RAG storage abstraction and startup migrations#359
sqhyz55 wants to merge 21 commits into
xorbitsai:mainfrom
sqhyz55:feat/phase1a-storage-abstraction-cleanup

Conversation

@sqhyz55
Copy link
Copy Markdown
Collaborator

@sqhyz55 sqhyz55 commented May 8, 2026

No description provided.

Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The storage abstraction refactoring is well-structured and clean. Moving cascade logic into lancedb_cascade_vector_plane.py, extracting migration logic from app.py, and delegating to contract methods consistently improves separation of concerns. All existing tests pass.

Test Coverage Regression

test_list_candidates.py was reduced from 10 tests to 5. The removed tests covered code paths that still exist in list_candidates.py:

  • test_state_filterlist_candidates.py:314: if state is not None filter
  • test_limit_filterlist_candidates.py:327: candidates = candidates[:limit] truncation
  • test_sort_before_limitlist_candidates.py:321-324: order_by sorting before limit
  • test_model_tag_filter_for_embeddings — partially covered by test_embed_candidates_require_model_tag, but the actual passing of model_tag to store.list_version_candidates() is no longer verified

These tests should be re-added, adapted to the new mock pattern using Mock() for get_vector_index_store.

Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review Summary

Three issues from the previous review are still not addressed, and one new bug was found.

Still Not Fixed (inline)

  1. Missing test_state_filterlist_candidates.py:314-315 has if state is not None: filter logic with zero test coverage.
  2. Missing test_limit_filterlist_candidates.py:327-328 has candidates[:limit] truncation with zero test coverage.
  3. Missing test_sort_before_limitlist_candidates.py:321-324 has order_by sort logic with zero test coverage.

Fixed

  1. test_model_tag_filter_for_embeddings — ✅ Resolved. test_embed_candidates_from_store_rows:158-162 now verifies model_tag is passed to store.list_version_candidates().

New Bug (inline)

  1. Migration service background task gating skips documents backfill_should_start_background_task() only checks for user_id field migration (via check_table_needs_migration). When all tables already have user_id columns, it returns False, so start_background_migrations returns None and _check_and_backfill_documents_table never executes — even when the documents table has empty file_id values or NULL user_id values that need backfilling.

Suggestion

Add the three missing test cases to test_list_candidates.py using the new mock pattern (Mock() + patch("get_vector_index_store")). For the migration service, either extend _should_start_background_task to check for documents backfill need as well, or run the documents backfill check independently of the user_id migration gate.

Comment thread src/xagent/web/services/rag_storage_migration_service.py Outdated
@sqhyz55 sqhyz55 force-pushed the feat/phase1a-storage-abstraction-cleanup branch from d53354c to 8bc2f18 Compare May 11, 2026 11:28
@qinxuye qinxuye changed the title Feat/phase1a storage abstraction cleanup Refactor RAG storage abstraction and startup migrations May 13, 2026
@qinxuye qinxuye changed the title Refactor RAG storage abstraction and startup migrations refactor(storage): clean up RAG storage abstraction and startup migrations May 13, 2026
Copy link
Copy Markdown
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the resource leak in the new FTS-by-model helper before merging.

Comment thread src/xagent/core/tools/core/RAG_tools/storage/contracts.py
sqhyz55 added a commit to sqhyz55/xagent that referenced this pull request May 14, 2026
open_embeddings_table returns a live handle used only to resolve
table_name before search_* reopens by name; release it in finally
to avoid LanceDB handle leaks (PR xorbitsai#359 review).
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting validated review finding 1/5.

Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_sparse.py Outdated
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting validated review finding 2/5.

Comment thread src/xagent/web/services/kb_file_service.py Outdated
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting validated review finding 3/5.

Comment thread src/xagent/core/tools/core/RAG_tools/storage/contracts.py
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting validated review finding 4/5.

Comment thread src/xagent/web/services/rag_storage_migration_service.py Outdated
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting validated review finding 5/5.

Comment thread src/xagent/core/tools/core/RAG_tools/version_management/cascade_cleaner.py Outdated
sqhyz55 added 13 commits May 19, 2026 20:04
…-A1)

Remove raw LanceDB document queries from kb_file_service; use
list_document_records with DocumentRecord.collection for multi-table scans.
Add web unit test asserting delegation without LanceDB connection helpers.
… store (PR-A2)

Expose VectorIndexStore.search_fts / search_fts_by_model with tenant filters;
implement in LanceDBVectorIndexStore via query_to_list. Refactor search_sparse
and sync substring fallback to use iter_batches; update retrieval tests to mock
the store contract instead of raw table.search FTS.
… flows (PR-A3)

Move predicate planning/deletion mechanics to a dedicated LanceDB vector-plane module,
and add VectorIndexStore cascade APIs for adapter-level ownership. Keep
cascade_cleaner focused on domain predicate construction and orchestration.
…tore (PR-A4)

Add list_version_candidates to the storage contract and implement it in the
LanceDB adapter. Update list_candidates to consume contract-returned rows
instead of raw connection/table APIs, and migrate tests to contract-level mocks.
Extract storage migration checks and backfill orchestration from app startup into
RAGStorageMigrationService, keeping app.py backend-agnostic at the boundary.
Add focused service tests for background task scheduling and auto-migrate gating.
Remove LanceDB-specific collection_metadata schema/DDL logic from collection_manager
and route table initialization through MetadataStore.ensure_collection_metadata_table.
Add a manager test to verify metadata DDL delegation behavior.
Avoid mypy union-attr false positives for LanceDB table handles in
background document backfill checks by using Optional[Any].
…lures

Propagate VersionManagementError for invalid step/model inputs in list_candidates,
while wrapping only backend query failures as DatabaseOperationError. Improve
migration service logging by separating missing-table debug cases from real
backfill failures and add focused tests for both paths.
Clean up leftover branch edits by removing unused imports and normalizing line
wrapping in service and test files for consistent formatting.
…ability

Optimize sparse fallback and cascade predicate construction to avoid unnecessary
LanceDB table opens, add explicit handle close in iter_batches to reduce FD
pressure, and improve structured logging for FTS/cascade partial failures.
Only create startup migration background tasks when migration is actually needed, and align tests with the new contract and stricter access filtering behavior. This also restores sparse fallback table resolution compatibility and hardens warning-log assertions to avoid flaky failures.
… coverage

Add contract-level tests for state filtering, limit truncation, sort-before-limit behavior, and embed model_tag forwarding so the storage-abstraction path keeps the same behavioral guarantees after refactoring.
…tsai#362)

Add user_id/is_admin to async vector-search contract and pass them through search_vectors_by_model_async into the LanceDB adapter filter builder. Add regression tests to verify tenant scope forwarding for both convenience and adapter async APIs.
sqhyz55 and others added 7 commits May 19, 2026 20:05
Add direct unit coverage for sync vector/fts/batch/count APIs, ingestion status store sync+async paths, and sparse async retrieval fallback/error behavior to close previously identified storage-layer test gaps.
Move DatabaseOperationError to module-level imports in contracts, replace cross-module private helper usage with a public merge-error classifier, and extract a shared Arrow-table conversion helper to remove duplicated async search code.
Restore startup behavior to always create the background migration task under LANCEDB_AUTO_MIGRATE so documents backfill checks still run even when user_id schema migration is not required.
Extend VectorIndexStore.list_document_records with an optional file_ids
parameter so callers can retrieve document records for specific uploaded
files without constructing backend-specific filter SQL.

Co-authored-by: Cursor <cursoragent@cursor.com>
Rewrite aggregate_uploaded_file_statuses, _load_indexed_doc_refs, and
reconcile_uploaded_files in kb_file_service to use the storage abstraction
layer instead of direct LanceDB calls. This removes all raw LanceDB
imports (get_connection_from_env, ensure_documents_table, query_to_list,
build_lancedb_filter_expression, escape_lancedb_string, UserPermissions,
list_embeddings_table_names, _safe_close_table) from the web service layer.

The new list_document_records file_ids parameter replaces hand-built
file_id IN SQL filters, and aggregate_document_counts replaces direct
table scans for the indexed-artifact fallback check.

Also fix auto_migrate variable lost during rebase in app.py startup.
open_embeddings_table returns a live handle used only to resolve
table_name before search_* reopens by name; release it in finally
to avoid LanceDB handle leaks (PR xorbitsai#359 review).
…migrations

Restore per-embeddings-table cascade predicates, pin route collection in
sparse fallback scans, forward user scope through async FTS, batch unbounded
file-status document lookups, and always schedule migration compatibility
checks while keeping backfill gated by LANCEDB_AUTO_MIGRATE.
@sqhyz55 sqhyz55 force-pushed the feat/phase1a-storage-abstraction-cleanup branch from 9502e3d to d4204a1 Compare May 19, 2026 12:06
…y task

When LANCEDB_AUTO_MIGRATE is false, startup still schedules a compatibility
check task but must not run backfill_all; update the integration assertion
accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants