fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202
fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202sqhyz55 wants to merge 3 commits into
Conversation
62fa3d7 to
5f2b867
Compare
rogercloud
left a comment
There was a problem hiding this comment.
Review Summary
This PR correctly addresses the stale chunk rows bug (#199) by introducing replace_chunks with an insert-before-delete safety pattern. Tests pass (27 chunk doc + 4 replace_chunks).
Additional notes (pre-existing / future concerns, not blocking):
UnboundLocalErrorinfinally(lancedb_stores.py:1478): Ifconn.open_table("chunks")raises,tableis unbound andfinallyraisesUnboundLocalError, masking the root exception. This pattern is pre-existing across 50+ methods in this file — not introduced by this PR, but worth fixing separately.str(None)in scope value (lancedb_stores.py:1486):str(None)produces the literal string"None"rather thanIS NULL. Harmless for current allowed keys (all non-null strings), but worth noting for future column additions.- PR body mismatch: The description mentions a streaming upload buffer rename (
chunk_size→file_read_buffer_size) that is not in this diff —file_read_buffer_sizealready exists on main atkb.py:1182.
rogercloud
left a comment
There was a problem hiding this comment.
Additional review comments for issues not covered by the existing review threads.
…add ingestion lock - Reject empty replace_scope to prevent unbounded deletion (comment 1) - Use merge_insert instead of table.add for idempotent upsert (comment 2) - Validate chunk_id presence before write to fail fast (comment 3) - Fix docstring to match insert-before-delete order (comment 4) - Cascade-delete orphaned embeddings rows after chunk replacement (comment 5) - Allow replace_chunks with empty records to clean stale data (comment 6) - Add per-document threading lock in process_document to prevent concurrent chunk replacement races (comment 7) - Update tests for merge_insert and add ingestion lock tests
rogercloud
left a comment
There was a problem hiding this comment.
Thanks for the follow-up fixes. I think the current patch still needs one more short-term fix, and the durable version of this should be tracked separately.
Long-term, this should probably move to an explicit generation/pointer model rather than relying on delete/insert timing. Each re-chunk run should write chunks and embeddings under a new immutable generation_id. The currently searchable generation should be selected through a small active-generation pointer table. Retrieval should only read embeddings for the active generation. After the new generation's chunks and embeddings are fully written, publish it by atomically moving the active pointer. Old generations can then be cleaned up asynchronously.
This makes re-chunking crash-safe and concurrency-safe: incomplete generations are never searchable, stale embeddings cannot reappear after cleanup failures, and concurrent runs do not delete each other's rows. The only operation that needs strict atomicity is the final pointer update.
This is larger than this PR, so please open a follow-up issue to track the generation/pointer design. This PR can use scoped locking plus visible cleanup failures as the short-term fix.
… cleanup failures (PR xorbitsai#202) Replace source_path ingestion locks with cross-process generation_ingestion_lock keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace and embedding writes. Raise DatabaseOperationError when embeddings_* cascade delete fails. Add regression tests for empty replace_scope, partial cleanup failures, and per-user lock isolation. Refs: xorbitsai#199, xorbitsai#438
When users re-chunk a document with different parameters, the old chunk rows (different config_hash) must be removed. Introduce `replace_chunks()` in the storage abstraction layer that safely inserts new rows first then deletes the old generation (crash-safe: worst case is brief duplicate data, never data loss). Also: - Add scope-key whitelist validation to prevent filter injection - Add concurrency note (last-write-wins) to chunk_document docstring - Add 4 unit tests for replace_chunks edge cases - Make separator-vs-default integration test deterministic (tmp_path) Refs: xorbitsai#199
…add ingestion lock - Reject empty replace_scope to prevent unbounded deletion (comment 1) - Use merge_insert instead of table.add for idempotent upsert (comment 2) - Validate chunk_id presence before write to fail fast (comment 3) - Fix docstring to match insert-before-delete order (comment 4) - Cascade-delete orphaned embeddings rows after chunk replacement (comment 5) - Allow replace_chunks with empty records to clean stale data (comment 6) - Add per-document threading lock in process_document to prevent concurrent chunk replacement races (comment 7) - Update tests for merge_insert and add ingestion lock tests
… cleanup failures (PR xorbitsai#202) Replace source_path ingestion locks with cross-process generation_ingestion_lock keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace and embedding writes. Raise DatabaseOperationError when embeddings_* cascade delete fails. Add regression tests for empty replace_scope, partial cleanup failures, and per-user lock isolation. Refs: xorbitsai#199, xorbitsai#438
91aba1f to
1989f50
Compare
The streaming upload loop used a local variable named chunk_size for the 1MiB read buffer, shadowing the FastAPI Form field and forcing ~1M char chunks. Rename it to file_read_buffer_size.
Also delete chunk rows for the same collection/doc_id/parse_hash before merge_insert so re-chunking does not leave stale rows across config_hash (issue #199). Update chunk tests for single-generation semantics.
Refs: #199