fix(core): prevent redundant re-embedding in incremental indexing with batch_size#35631
Open
Alon Naor (AlonNaor22) wants to merge 1 commit intolangchain-ai:masterfrom
Open
Conversation
…dant re-embedding When using `index()`/`aindex()` with `cleanup="incremental"` and `batch_size` smaller than the total number of documents sharing the same `source_id`, the per-batch incremental cleanup would prematurely delete records from later batches that hadn't been processed yet. This caused all subsequent batches to re-embed and re-add documents on every single run, wasting compute. The fix moves the incremental cleanup from inside the batch loop to after all batches are processed, collecting source_ids during iteration. This ensures all documents are updated in the record manager before stale ones are identified and deleted. Closes langchain-ai#32612 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merging this PR will not alter performance
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
batch_sizetreated as non-existant #32612index()/aindex()withcleanup="incremental"andbatch_sizesmaller than the total number of documents sharing the samesource_id, the per-batch incremental cleanup prematurely deleted records from later batches that hadn't been processed yetFiles changed
libs/core/langchain_core/indexing/api.py— moved incremental cleanup block from inside the batch loop to after the loop, for bothindex()andaindex()libs/core/tests/unit_tests/indexing/test_indexing.py— updated existing test to expect correct behavior; added 3 new regression tests (sync, async, stale-docs-still-cleaned-up)Test plan
test_incremental_indexing_with_batch_sizeto expect{num_added: 0, num_deleted: 0, num_skipped: 4}instead of the previous{num_added: 2, num_deleted: 2, num_skipped: 2}test_incremental_indexing_with_batch_size_no_redundant_work— 10 docs, batch_size=3, same source_id, verify 3 consecutive runs all skip without redundant worktest_aincremental_indexing_with_batch_size_no_redundant_work— async equivalenttest_incremental_indexing_stale_docs_still_cleaned_up— verifies that when document content actually changes, old versions are still properly deletedtest_index_into_document_indexunrelated to this change)ruff checkandruff formatclean🤖 Generated with Claude Code