Skip to content

fix: off-by-one error in RecursiveJsonSplitter.split_json#35649

Open
Gamble Tan (gambletan) wants to merge 2 commits into
langchain-ai:masterfrom
gambletan:fix/json-splitter-off-by-one
Open

fix: off-by-one error in RecursiveJsonSplitter.split_json#35649
Gamble Tan (gambletan) wants to merge 2 commits into
langchain-ai:masterfrom
gambletan:fix/json-splitter-off-by-one

Conversation

@gambletan

Copy link
Copy Markdown
Contributor

Summary

Fixes #29153

Two bugs in RecursiveJsonSplitter._json_split() that cause data loss at chunk boundaries:

  1. Off-by-one boundary check: Changed size < remaining to size <= remaining so items that fit exactly at the chunk size boundary are added to the current chunk instead of being pushed to a new one.

  2. Empty dict/leaf value loss: When a value doesn't fit in the current chunk and a new chunk is started, the code recursed into the value. For empty dicts {}, the for key, value in data.items() loop has zero iterations, silently dropping the key-value pair from all chunks. Added explicit handling to directly set leaf values and empty dicts instead of recursing.

Reproduction

from langchain_text_splitters import RecursiveJsonSplitter

data = {
    "projects": {
        "GTMS": {f"GTMS-{i}": {} for i in range(1, 23)},
        "ITSAMPLE": {f"ITSAMPLE-{i}": {} for i in range(1, 4)},
    }
}

splitter = RecursiveJsonSplitter(max_chunk_size=300)
chunks = splitter.split_json(data)

# Before fix: GTMS-10 and ITSAMPLE-2 are lost at chunk boundaries
# After fix: all items are present in chunks

Test plan

🤖 Generated with Claude Code

Two issues fixed:
1. Changed `size < remaining` to `size <= remaining` so items that fit
   exactly at the boundary are added to the current chunk instead of
   being pushed to a new one.
2. Added explicit handling for empty dicts and leaf values when they
   don't fit in the current chunk. Previously, recursing into an empty
   dict `{}` caused the for loop to have zero iterations, silently
   dropping the key-value pair from all chunks.

Fixes langchain-ai#29153

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added text-splitters Related to the package `text-splitters` external fix For PRs that implement a fix labels Mar 8, 2026
@nidhishgajjar

This comment was marked as spam.

# Conflicts:
#	libs/text-splitters/langchain_text_splitters/json.py

@avinashkamat48 era (avinashkamat48) left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the splitter behavior for boundary-sized values and empty dict leaves, but I do not see a regression test in this PR. The previous bug sounds easy to reintroduce because the <= remaining condition and the non-empty-dict recursion are both subtle. Could you add tests for a value exactly equal to the remaining chunk size and for an empty dict / scalar leaf that previously got dropped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external fix For PRs that implement a fix size: XS < 50 LOC text-splitters Related to the package `text-splitters`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offset by 1 bug on RecursiveJsonSplitter::split_json() function

3 participants