ref(kb): H03 move embedding lifecycle into handle#658
Conversation
Move embedding storage mechanics into KBCollectionHandle (epic xorbitsai#494 Phase 2, after xorbitsai#509): query-vector validation, chunk selection for embedding, idempotent vector writes, per-model grouping/table routing, and stale/rollback cleanup are now owned by the handle and store layers. - Add handle methods: validate_query_vector, read_chunks_needing_embedding, write_embeddings (+ _process_model_embeddings), delete_embedding_records, and rollback compensation (snapshot/restore/delete_created_embeddings). - Route legacy vector_manager helpers and the vector-storage facade through the coordinator-opened handle; keep historic import paths and signatures. - Add row-only multi-table delete_embedding_records store primitive sharing cleanup_filters semantics with stale cleanup. - Add EmbeddingRecordDetail/EmbeddingRecordSnapshot lossless row models. - Move merge-error classifier to storage/merge_errors.py, fixing an upward import from storage into vector_storage. Behavior-preserving: validation rules, chunk selection, chunk-hash staleness, idempotent merge, model-tag routing, dimension checks, result models, and non-recoverable merge handling are unchanged. Live coordinator rollback wiring is deferred to xorbitsai#514. Refs xorbitsai#510
There was a problem hiding this comment.
Code Review
This pull request refactors the RAG tools' vector storage data plane by moving embedding lifecycle operations—such as query-vector validation, chunk reads, vector writes, and rollback cleanup—from the legacy vector_manager.py into the coordinator-owned LanceDBCollectionHandle and storage layers. It also introduces new schemas for lossless embedding snapshots and restores, along with comprehensive test coverage. The review feedback highlights two key areas for improvement in collection_handle.py: resolving a data correctness issue where NaN page numbers are incorrectly coerced to 1 instead of None, and adding robust error handling when parsing integer environment variables to prevent potential application crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new LanceDBCollectionHandle to centralize embedding lifecycle management, including query vector validation, chunk reading, and embedding writes. It refactors existing vector storage logic by moving storage mechanics into this handle and introduces robust error classification for merge-insert operations. I have kept the review comment regarding the brittle string-based check for spill errors, as it identifies a maintainability risk and provides an actionable suggestion.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
Move embedding storage mechanics into
KBCollectionHandle(epic #494, Phase 2 "KBCollectionHandle Replacement", following #509). Query-vector validation, chunk selection for embedding, idempotent vector writes, per-model grouping/table routing, and stale/rollback cleanup are now owned by the handle and store layers. Legacy vector-storage helpers keep their import paths and signatures and delegate through the coordinator-opened handle.Closes #510.
Changes
kb/collection_handle.py): addvalidate_query_vector,read_chunks_needing_embedding,write_embeddings(+_process_model_embeddings),delete_embedding_records, and rollback compensation (snapshot_embeddings/restore_embeddings/delete_created_embeddings,_embedding_table_names).kb/vector_storage_compatibility.py): routeread_chunks_for_embedding/write_vectors_to_dbthrough a coordinator-opened handle;validate_query_vectordelegates to the pure handle-owned format validator. Preserves shim injection.delete_embedding_recordsprimitive (storage/contracts.py+storage/lancedb_stores.py), sharingcleanup_filterssemantics with stale cleanup.EmbeddingRecordDetail/EmbeddingRecordSnapshotrow models (withgroup_by_model_tag).storage/merge_errors.py, removing an upward import fromstorageintovector_storage.Behavior preserved
Validation rules, chunk selection, chunk-hash staleness detection, idempotent merge (3-key
collection/doc_id/chunk_id), model-tag table naming/routing, dimension validation, result-model fields, and non-recoverable merge-error handling are all unchanged, locked by a characterization suite written against the original code.Out of scope (deferred)
Tests
New: characterization, schema row models, store delete primitive, and handle (read/write/delete/rollback/mechanics) suites; existing vector_manager/compatibility tests trimmed to the surviving public surface. Focused suite: 92 passing.
ruff,mypy, and the full pre-commit hook set pass.