fix(kb): respect Form chunk_size in ingest (issue #199, part 1) by sqhyz55 · Pull Request #202 · xorbitsai/xagent

sqhyz55 · 2026-03-23T02:22:54Z

The streaming upload loop used a local variable named chunk_size for the 1MiB read buffer, shadowing the FastAPI Form field and forcing ~1M char chunks. Rename it to file_read_buffer_size.

Also delete chunk rows for the same collection/doc_id/parse_hash before merge_insert so re-chunking does not leave stale rows across config_hash (issue #199). Update chunk tests for single-generation semantics.

Refs: #199

rogercloud

Review Summary

This PR correctly addresses the stale chunk rows bug (#199) by introducing replace_chunks with an insert-before-delete safety pattern. Tests pass (27 chunk doc + 4 replace_chunks).

Additional notes (pre-existing / future concerns, not blocking):

UnboundLocalError in finally (lancedb_stores.py:1478): If conn.open_table("chunks") raises, table is unbound and finally raises UnboundLocalError, masking the root exception. This pattern is pre-existing across 50+ methods in this file — not introduced by this PR, but worth fixing separately.
str(None) in scope value (lancedb_stores.py:1486): str(None) produces the literal string "None" rather than IS NULL. Harmless for current allowed keys (all non-null strings), but worth noting for future column additions.
PR body mismatch: The description mentions a streaming upload buffer rename (chunk_size → file_read_buffer_size) that is not in this diff — file_read_buffer_size already exists on main at kb.py:1182.

rogercloud

Additional review comments for issues not covered by the existing review threads.

…add ingestion lock - Reject empty replace_scope to prevent unbounded deletion (comment 1) - Use merge_insert instead of table.add for idempotent upsert (comment 2) - Validate chunk_id presence before write to fail fast (comment 3) - Fix docstring to match insert-before-delete order (comment 4) - Cascade-delete orphaned embeddings rows after chunk replacement (comment 5) - Allow replace_chunks with empty records to clean stale data (comment 6) - Add per-document threading lock in process_document to prevent concurrent chunk replacement races (comment 7) - Update tests for merge_insert and add ingestion lock tests

rogercloud

Thanks for the follow-up fixes. I think the current patch still needs one more short-term fix, and the durable version of this should be tracked separately.

Long-term, this should probably move to an explicit generation/pointer model rather than relying on delete/insert timing. Each re-chunk run should write chunks and embeddings under a new immutable generation_id. The currently searchable generation should be selected through a small active-generation pointer table. Retrieval should only read embeddings for the active generation. After the new generation's chunks and embeddings are fully written, publish it by atomically moving the active pointer. Old generations can then be cleaned up asynchronously.

This makes re-chunking crash-safe and concurrency-safe: incomplete generations are never searchable, stale embeddings cannot reappear after cleanup failures, and concurrent runs do not delete each other's rows. The only operation that needs strict atomicity is the final pointer update.

This is larger than this PR, so please open a follow-up issue to track the generation/pointer design. This PR can use scoped locking plus visible cleanup failures as the short-term fix.

… cleanup failures (PR xorbitsai#202) Replace source_path ingestion locks with cross-process generation_ingestion_lock keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace and embedding writes. Raise DatabaseOperationError when embeddings_* cascade delete fails. Add regression tests for empty replace_scope, partial cleanup failures, and per-user lock isolation. Refs: xorbitsai#199, xorbitsai#438

When users re-chunk a document with different parameters, the old chunk rows (different config_hash) must be removed. Introduce `replace_chunks()` in the storage abstraction layer that safely inserts new rows first then deletes the old generation (crash-safe: worst case is brief duplicate data, never data loss). Also: - Add scope-key whitelist validation to prevent filter injection - Add concurrency note (last-write-wins) to chunk_document docstring - Add 4 unit tests for replace_chunks edge cases - Make separator-vs-default integration test deterministic (tmp_path) Refs: xorbitsai#199

…add ingestion lock - Reject empty replace_scope to prevent unbounded deletion (comment 1) - Use merge_insert instead of table.add for idempotent upsert (comment 2) - Validate chunk_id presence before write to fail fast (comment 3) - Fix docstring to match insert-before-delete order (comment 4) - Cascade-delete orphaned embeddings rows after chunk replacement (comment 5) - Allow replace_chunks with empty records to clean stale data (comment 6) - Add per-document threading lock in process_document to prevent concurrent chunk replacement races (comment 7) - Update tests for merge_insert and add ingestion lock tests

… cleanup failures (PR xorbitsai#202) Replace source_path ingestion locks with cross-process generation_ingestion_lock keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace and embedding writes. Raise DatabaseOperationError when embeddings_* cascade delete fails. Add regression tests for empty replace_scope, partial cleanup failures, and per-user lock isolation. Refs: xorbitsai#199, xorbitsai#438

XprobeBot added the bug Something isn't working label Mar 23, 2026

sqhyz55 force-pushed the fix/kb-ingest-chunk-size-buffer-shadow branch from 62fa3d7 to 5f2b867 Compare May 11, 2026 09:51

rogercloud requested changes May 12, 2026

View reviewed changes

rogercloud reviewed May 12, 2026

View reviewed changes

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_document.py

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_document.py Outdated

Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py

rogercloud requested changes May 15, 2026

View reviewed changes

Comment thread src/xagent/core/tools/core/RAG_tools/pipelines/document_ingestion.py Outdated

Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py Outdated

sqhyz55 mentioned this pull request May 19, 2026

feat(rag): generation/pointer model for crash-safe re-chunking and embedding lifecycle #438

Open

5 tasks

sqhyz55 added 3 commits May 19, 2026 15:15

sqhyz55 force-pushed the fix/kb-ingest-chunk-size-buffer-shadow branch from 91aba1f to 1989f50 Compare May 19, 2026 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202

fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202
sqhyz55 wants to merge 3 commits into
xorbitsai:mainfrom
sqhyz55:fix/kb-ingest-chunk-size-buffer-shadow

sqhyz55 commented Mar 23, 2026

Uh oh!

rogercloud left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rogercloud left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rogercloud left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sqhyz55 commented Mar 23, 2026

Uh oh!

rogercloud left a comment

Choose a reason for hiding this comment

Review Summary

Additional notes (pre-existing / future concerns, not blocking):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rogercloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rogercloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants