Skip to content

fix(text-splitters): fix off-by-one data loss in RecursiveJsonSplitter#35410

Closed
HARSHIL GARG (harshil562) wants to merge 3 commits into
langchain-ai:masterfrom
harshil562:fix/recursive-json-splitter-off-by-one-data-loss
Closed

fix(text-splitters): fix off-by-one data loss in RecursiveJsonSplitter#35410
HARSHIL GARG (harshil562) wants to merge 3 commits into
langchain-ai:masterfrom
harshil562:fix/recursive-json-splitter-off-by-one-data-loss

Conversation

@harshil562

Copy link
Copy Markdown

Description

Fixes an off-by-one data loss bug in RecursiveJsonSplitter._json_split() where dictionary entries with empty dict values ({}) at chunk boundaries were silently dropped during splitting.

Closes #29153

Problem

When _json_split encounters a key-value pair that does not fit in the current chunk, it starts a new chunk and recursively calls itself on the value. However, when the value is an empty dict {}:

  1. The recursion enters isinstance(data, dict)True
  2. {}.items() produces an empty iterator, so the for-loop body never executes
  3. The key is never added to any chunk → silent data loss

This manifests as items disappearing at chunk boundaries. For example, with the input from the issue, GTMS-10 and ITSAMPLE-2 are completely dropped from the output because they happen to land exactly on chunk edges.

Root Cause

# Before (lines 109-110):
# Iterate
self._json_split(value, new_path, chunks)

When value = {}, _json_split({}, path, chunks) is a no-op — the for-loop over {}.items() never executes, and the key-value pair is lost.

Solution

Before recursing, check whether value is a non-empty dict. Only non-empty dicts benefit from recursive splitting. Empty dicts and all leaf values (strings, numbers, lists, None) are added directly to the current chunk:

if isinstance(value, dict) and value:
    # Non-empty dict: recurse to split further
    self._json_split(value, new_path, chunks)
else:
    # Leaf value or empty dict: add directly to
    # avoid data loss from recursing into an empty
    # iterator
    self._set_nested_dict(chunks[-1], new_path, value)

Tradeoffs Considered

  • Minimal change: The fix adds a single isinstance + truthiness check before the existing recursion. No API changes, no new parameters, no breaking changes.
  • Correctness over performance: While this also avoids an unnecessary recursive call for leaf values, the primary motivation is correctness (preventing data loss).
  • Empty list handling: Empty lists ([]) are not dicts, so they already took the else path in the original recursion. This fix is consistent with that behavior.

Tests Added

Three new test functions covering:

  1. test_split_json_no_data_loss_on_chunk_boundary — Reproduces the exact scenario from issue Offset by 1 bug on RecursiveJsonSplitter::split_json() function #29153 with the same input data and max_chunk_size=216. Verifies that GTMS-10 and ITSAMPLE-2 are no longer dropped.
  2. test_split_json_empty_dict_values_preserved — Tests that all 20 empty dict values in a synthetic dataset survive splitting with a small chunk size.
  3. test_split_json_non_dict_leaf_values_preserved — Tests that 30 string leaf values at chunk boundaries are preserved with correct key-value mapping.

All existing tests continue to pass (125 passed, 4 skipped).

Verification

$ uv run pytest tests/unit_tests/test_text_splitters.py -v
125 passed, 4 skipped
$ uv run ruff check langchain_text_splitters/json.py tests/unit_tests/test_text_splitters.py
All checks passed!

Disclaimer: This contribution was developed with the assistance of AI agents for analysis and implementation.

When a key-value pair at a chunk boundary has an empty dict value,
_json_split recursed into the empty dict which produced no items,
silently dropping the key. This fix checks whether the value is a
non-empty dict before recursing; empty dicts and leaf values are
now added directly to the current chunk.

Fixes langchain-ai#29153
@github-actions github-actions Bot added external text-splitters Related to the package `text-splitters` fix For PRs that implement a fix and removed external labels Feb 23, 2026
Add explicit dict[str, Any] type annotations to satisfy mypy
var-annotated checks on complex nested dict literals.
@github-actions github-actions Bot added the size: S 50-199 LOC label Mar 9, 2026
@nidhishgajjar

This comment was marked as spam.

@nidhishgajjar

This comment was marked as spam.

@open-swe

open-swe Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Thanks for catching this! #35649 fixes the same data-loss issue (and additionally corrects the boundary comparison) and currently merges cleanly, whereas this branch has conflicts. Closing in favor of #35649 — your regression tests would be a welcome addition there.

@open-swe open-swe Bot closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external fix For PRs that implement a fix size: S 50-199 LOC text-splitters Related to the package `text-splitters`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offset by 1 bug on RecursiveJsonSplitter::split_json() function

2 participants