Skip to content

fix: off-by-one error in RecursiveJsonSplitter.split_json#35649

Open
Ethan T. (gambletan) wants to merge 1 commit intolangchain-ai:masterfrom
gambletan:fix/json-splitter-off-by-one
Open

fix: off-by-one error in RecursiveJsonSplitter.split_json#35649
Ethan T. (gambletan) wants to merge 1 commit intolangchain-ai:masterfrom
gambletan:fix/json-splitter-off-by-one

Conversation

@gambletan
Copy link
Copy Markdown
Contributor

Summary

Fixes #29153

Two bugs in RecursiveJsonSplitter._json_split() that cause data loss at chunk boundaries:

  1. Off-by-one boundary check: Changed size < remaining to size <= remaining so items that fit exactly at the chunk size boundary are added to the current chunk instead of being pushed to a new one.

  2. Empty dict/leaf value loss: When a value doesn't fit in the current chunk and a new chunk is started, the code recursed into the value. For empty dicts {}, the for key, value in data.items() loop has zero iterations, silently dropping the key-value pair from all chunks. Added explicit handling to directly set leaf values and empty dicts instead of recursing.

Reproduction

from langchain_text_splitters import RecursiveJsonSplitter

data = {
    "projects": {
        "GTMS": {f"GTMS-{i}": {} for i in range(1, 23)},
        "ITSAMPLE": {f"ITSAMPLE-{i}": {} for i in range(1, 4)},
    }
}

splitter = RecursiveJsonSplitter(max_chunk_size=300)
chunks = splitter.split_json(data)

# Before fix: GTMS-10 and ITSAMPLE-2 are lost at chunk boundaries
# After fix: all items are present in chunks

Test plan

🤖 Generated with Claude Code

Two issues fixed:
1. Changed `size < remaining` to `size <= remaining` so items that fit
   exactly at the boundary are added to the current chunk instead of
   being pushed to a new one.
2. Added explicit handling for empty dicts and leaf values when they
   don't fit in the current chunk. Previously, recursing into an empty
   dict `{}` caused the for loop to have zero iterations, silently
   dropping the key-value pair from all chunks.

Fixes langchain-ai#29153

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added text-splitters Related to the package `text-splitters` external fix For PRs that implement a fix labels Mar 8, 2026
@org-membership-reviewer org-membership-reviewer bot added the size: XS < 50 LOC label Mar 9, 2026
@nidhishgajjar
Copy link
Copy Markdown

Orb Code Review (powered by GLM 5.1 on Orb Cloud)

PR #35649: fix: off-by-one error in RecursiveJsonSplitter.split_json

Findings

1. Off-by-one fix (<<=) ✅ Looks correct
Changing if size < remaining to if size <= remaining is the right call. When size == remaining, the item fits exactly within the remaining chunk capacity and should be included in the current chunk rather than unnecessarily starting a new one. The old behavior would produce an extra chunk and slightly suboptimal splitting.

2. Leaf value / empty dict handling ✅ Looks correct
This is the more impactful fix. In the original code, when an item exceeds the remaining chunk space:

  • The code creates a new chunk (chunks.append({}))
  • Then unconditionally recurses via self._json_split(value, new_path, chunks)

The problem: if value is an empty dict {}, _json_split enters the isinstance(data, dict) branch, iterates over zero items, and silently drops the empty dict — it's never added to any chunk. This is a data loss bug.

The fix correctly handles this by checking whether value is a non-empty dict before recursing, and otherwise adding it directly to the current chunk via _set_nested_dict. This also avoids an unnecessary function call for leaf values (strings, numbers, etc.).

Minor observations

  • Consider adding a test case that exercises empty dict values and values that exactly fill a chunk to prevent regression.

Summary: A clean, focused fix addressing two real issues: an off-by-one boundary condition and silent data loss for empty dict values during JSON splitting.

Assessment: approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external fix For PRs that implement a fix size: XS < 50 LOC text-splitters Related to the package `text-splitters`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offset by 1 bug on RecursiveJsonSplitter::split_json() function

2 participants