fix(core): prevent redundant re-embedding in incremental indexing with batch_size by AlonNaor22 · Pull Request #35631 · langchain-ai/langchain

Alon Naor (AlonNaor22) · 2026-03-07T10:22:44Z

Summary

Fixes Chunks after batch_size treated as non-existant #32612
When using index()/aindex() with cleanup="incremental" and batch_size smaller than the total number of documents sharing the same source_id, the per-batch incremental cleanup prematurely deleted records from later batches that hadn't been processed yet
This caused all subsequent batches to re-embed and re-add documents on every single run, wasting significant compute (embedding API calls / GPU time)
The fix moves the incremental cleanup from inside the batch loop to after all batches complete, collecting source IDs during iteration — ensuring all documents are updated in the record manager before stale ones are identified and deleted

Files changed

libs/core/langchain_core/indexing/api.py — moved incremental cleanup block from inside the batch loop to after the loop, for both index() and aindex()
libs/core/tests/unit_tests/indexing/test_indexing.py — updated existing test to expect correct behavior; added 3 new regression tests (sync, async, stale-docs-still-cleaned-up)

Test plan

Updated test_incremental_indexing_with_batch_size to expect {num_added: 0, num_deleted: 0, num_skipped: 4} instead of the previous {num_added: 2, num_deleted: 2, num_skipped: 2}
Added test_incremental_indexing_with_batch_size_no_redundant_work — 10 docs, batch_size=3, same source_id, verify 3 consecutive runs all skip without redundant work
Added test_aincremental_indexing_with_batch_size_no_redundant_work — async equivalent
Added test_incremental_indexing_stale_docs_still_cleaned_up — verifies that when document content actually changes, old versions are still properly deleted
All 44 indexing tests pass (2 deselected are pre-existing failures in test_index_into_document_index unrelated to this change)
ruff check and ruff format clean

🤖 Generated with Claude Code

…dant re-embedding When using `index()`/`aindex()` with `cleanup="incremental"` and `batch_size` smaller than the total number of documents sharing the same `source_id`, the per-batch incremental cleanup would prematurely delete records from later batches that hadn't been processed yet. This caused all subsequent batches to re-embed and re-add documents on every single run, wasting compute. The fix moves the incremental cleanup from inside the batch loop to after all batches are processed, collecting source_ids during iteration. This ensures all documents are updated in the record manager before stale ones are identified and deleted. Closes langchain-ai#32612 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codspeed-hq · 2026-03-07T10:27:24Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 13 untouched benchmarks
⏩ 23 skipped benchmarks¹

_{Comparing AlonNaor22:fix/incremental-indexing-batch-size-bug (356d491) with master (29134dc)}

23 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Alon Naor (AlonNaor22) requested a review from Eugene Yurtsev (eyurtsev) as a code owner March 7, 2026 10:22

github-actions bot added core `langchain-core` package issues & PRs fix For PRs that implement a fix external labels Mar 7, 2026

github-actions bot added the size: M 200-499 LOC label Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): prevent redundant re-embedding in incremental indexing with batch_size#35631

fix(core): prevent redundant re-embedding in incremental indexing with batch_size#35631
Alon Naor (AlonNaor22) wants to merge 1 commit intolangchain-ai:masterfrom
AlonNaor22:fix/incremental-indexing-batch-size-bug

Alon Naor (AlonNaor22) commented Mar 7, 2026

Uh oh!

codspeed-hq bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alon Naor (AlonNaor22) commented Mar 7, 2026

Summary

Files changed

Test plan

Uh oh!

codspeed-hq bot commented Mar 7, 2026

Merging this PR will not alter performance

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant