Skip to content

fix(core): prevent redundant re-embedding in incremental indexing with batch_size#35631

Open
Alon Naor (AlonNaor22) wants to merge 1 commit intolangchain-ai:masterfrom
AlonNaor22:fix/incremental-indexing-batch-size-bug
Open

fix(core): prevent redundant re-embedding in incremental indexing with batch_size#35631
Alon Naor (AlonNaor22) wants to merge 1 commit intolangchain-ai:masterfrom
AlonNaor22:fix/incremental-indexing-batch-size-bug

Conversation

@AlonNaor22
Copy link

Summary

  • Fixes Chunks after batch_size treated as non-existant #32612
  • When using index()/aindex() with cleanup="incremental" and batch_size smaller than the total number of documents sharing the same source_id, the per-batch incremental cleanup prematurely deleted records from later batches that hadn't been processed yet
  • This caused all subsequent batches to re-embed and re-add documents on every single run, wasting significant compute (embedding API calls / GPU time)
  • The fix moves the incremental cleanup from inside the batch loop to after all batches complete, collecting source IDs during iteration — ensuring all documents are updated in the record manager before stale ones are identified and deleted

Files changed

  • libs/core/langchain_core/indexing/api.py — moved incremental cleanup block from inside the batch loop to after the loop, for both index() and aindex()
  • libs/core/tests/unit_tests/indexing/test_indexing.py — updated existing test to expect correct behavior; added 3 new regression tests (sync, async, stale-docs-still-cleaned-up)

Test plan

  • Updated test_incremental_indexing_with_batch_size to expect {num_added: 0, num_deleted: 0, num_skipped: 4} instead of the previous {num_added: 2, num_deleted: 2, num_skipped: 2}
  • Added test_incremental_indexing_with_batch_size_no_redundant_work — 10 docs, batch_size=3, same source_id, verify 3 consecutive runs all skip without redundant work
  • Added test_aincremental_indexing_with_batch_size_no_redundant_work — async equivalent
  • Added test_incremental_indexing_stale_docs_still_cleaned_up — verifies that when document content actually changes, old versions are still properly deleted
  • All 44 indexing tests pass (2 deselected are pre-existing failures in test_index_into_document_index unrelated to this change)
  • ruff check and ruff format clean

🤖 Generated with Claude Code

…dant re-embedding

When using `index()`/`aindex()` with `cleanup="incremental"` and
`batch_size` smaller than the total number of documents sharing the
same `source_id`, the per-batch incremental cleanup would prematurely
delete records from later batches that hadn't been processed yet.
This caused all subsequent batches to re-embed and re-add documents
on every single run, wasting compute.

The fix moves the incremental cleanup from inside the batch loop to
after all batches are processed, collecting source_ids during
iteration. This ensures all documents are updated in the record
manager before stale ones are identified and deleted.

Closes langchain-ai#32612

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added core `langchain-core` package issues & PRs fix For PRs that implement a fix external labels Mar 7, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 7, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 13 untouched benchmarks
⏩ 23 skipped benchmarks1


Comparing AlonNaor22:fix/incremental-indexing-batch-size-bug (356d491) with master (29134dc)

Open in CodSpeed

Footnotes

  1. 23 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@github-actions github-actions bot added the size: M 200-499 LOC label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core `langchain-core` package issues & PRs external fix For PRs that implement a fix size: M 200-499 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chunks after batch_size treated as non-existant

1 participant