Skip to content

fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202

Open
sqhyz55 wants to merge 3 commits into
xorbitsai:mainfrom
sqhyz55:fix/kb-ingest-chunk-size-buffer-shadow
Open

fix(kb): respect Form chunk_size in ingest (issue #199, part 1)#202
sqhyz55 wants to merge 3 commits into
xorbitsai:mainfrom
sqhyz55:fix/kb-ingest-chunk-size-buffer-shadow

Conversation

@sqhyz55
Copy link
Copy Markdown
Collaborator

@sqhyz55 sqhyz55 commented Mar 23, 2026

The streaming upload loop used a local variable named chunk_size for the 1MiB read buffer, shadowing the FastAPI Form field and forcing ~1M char chunks. Rename it to file_read_buffer_size.

Also delete chunk rows for the same collection/doc_id/parse_hash before merge_insert so re-chunking does not leave stale rows across config_hash (issue #199). Update chunk tests for single-generation semantics.

Refs: #199

@XprobeBot XprobeBot added the bug Something isn't working label Mar 23, 2026
@sqhyz55 sqhyz55 force-pushed the fix/kb-ingest-chunk-size-buffer-shadow branch from 62fa3d7 to 5f2b867 Compare May 11, 2026 09:51
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR correctly addresses the stale chunk rows bug (#199) by introducing replace_chunks with an insert-before-delete safety pattern. Tests pass (27 chunk doc + 4 replace_chunks).

Additional notes (pre-existing / future concerns, not blocking):

  • UnboundLocalError in finally (lancedb_stores.py:1478): If conn.open_table("chunks") raises, table is unbound and finally raises UnboundLocalError, masking the root exception. This pattern is pre-existing across 50+ methods in this file — not introduced by this PR, but worth fixing separately.
  • str(None) in scope value (lancedb_stores.py:1486): str(None) produces the literal string "None" rather than IS NULL. Harmless for current allowed keys (all non-null strings), but worth noting for future column additions.
  • PR body mismatch: The description mentions a streaming upload buffer rename (chunk_sizefile_read_buffer_size) that is not in this diff — file_read_buffer_size already exists on main at kb.py:1182.

Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py
Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py
Comment thread src/xagent/core/tools/core/RAG_tools/storage/contracts.py Outdated
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review comments for issues not covered by the existing review threads.

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_document.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_document.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py
sqhyz55 added a commit to sqhyz55/xagent that referenced this pull request May 12, 2026
…add ingestion lock

- Reject empty replace_scope to prevent unbounded deletion (comment 1)
- Use merge_insert instead of table.add for idempotent upsert (comment 2)
- Validate chunk_id presence before write to fail fast (comment 3)
- Fix docstring to match insert-before-delete order (comment 4)
- Cascade-delete orphaned embeddings rows after chunk replacement (comment 5)
- Allow replace_chunks with empty records to clean stale data (comment 6)
- Add per-document threading lock in process_document to prevent
  concurrent chunk replacement races (comment 7)
- Update tests for merge_insert and add ingestion lock tests
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up fixes. I think the current patch still needs one more short-term fix, and the durable version of this should be tracked separately.

Long-term, this should probably move to an explicit generation/pointer model rather than relying on delete/insert timing. Each re-chunk run should write chunks and embeddings under a new immutable generation_id. The currently searchable generation should be selected through a small active-generation pointer table. Retrieval should only read embeddings for the active generation. After the new generation's chunks and embeddings are fully written, publish it by atomically moving the active pointer. Old generations can then be cleaned up asynchronously.

This makes re-chunking crash-safe and concurrency-safe: incomplete generations are never searchable, stale embeddings cannot reappear after cleanup failures, and concurrent runs do not delete each other's rows. The only operation that needs strict atomicity is the final pointer update.

This is larger than this PR, so please open a follow-up issue to track the generation/pointer design. This PR can use scoped locking plus visible cleanup failures as the short-term fix.

Comment thread src/xagent/core/tools/core/RAG_tools/pipelines/document_ingestion.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py Outdated
sqhyz55 added a commit to sqhyz55/xagent that referenced this pull request May 19, 2026
… cleanup failures (PR xorbitsai#202)

Replace source_path ingestion locks with cross-process generation_ingestion_lock
keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace
and embedding writes. Raise DatabaseOperationError when embeddings_* cascade
delete fails. Add regression tests for empty replace_scope, partial cleanup
failures, and per-user lock isolation.

Refs: xorbitsai#199, xorbitsai#438
sqhyz55 added 3 commits May 19, 2026 15:15
When users re-chunk a document with different parameters, the old chunk
rows (different config_hash) must be removed. Introduce
`replace_chunks()` in the storage abstraction layer that safely inserts
new rows first then deletes the old generation (crash-safe: worst case
is brief duplicate data, never data loss).

Also:
- Add scope-key whitelist validation to prevent filter injection
- Add concurrency note (last-write-wins) to chunk_document docstring
- Add 4 unit tests for replace_chunks edge cases
- Make separator-vs-default integration test deterministic (tmp_path)

Refs: xorbitsai#199
…add ingestion lock

- Reject empty replace_scope to prevent unbounded deletion (comment 1)
- Use merge_insert instead of table.add for idempotent upsert (comment 2)
- Validate chunk_id presence before write to fail fast (comment 3)
- Fix docstring to match insert-before-delete order (comment 4)
- Cascade-delete orphaned embeddings rows after chunk replacement (comment 5)
- Allow replace_chunks with empty records to clean stale data (comment 6)
- Add per-document threading lock in process_document to prevent
  concurrent chunk replacement races (comment 7)
- Update tests for merge_insert and add ingestion lock tests
… cleanup failures (PR xorbitsai#202)

Replace source_path ingestion locks with cross-process generation_ingestion_lock
keyed by (collection, doc_id, parse_hash, user scope), held through chunk replace
and embedding writes. Raise DatabaseOperationError when embeddings_* cascade
delete fails. Add regression tests for empty replace_scope, partial cleanup
failures, and per-user lock isolation.

Refs: xorbitsai#199, xorbitsai#438
@sqhyz55 sqhyz55 force-pushed the fix/kb-ingest-chunk-size-buffer-shadow branch from 91aba1f to 1989f50 Compare May 19, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants