fix(text-splitters): fix off-by-one data loss in RecursiveJsonSplitter#35410
Closed
HARSHIL GARG (harshil562) wants to merge 3 commits into
Closed
Conversation
When a key-value pair at a chunk boundary has an empty dict value, _json_split recursed into the empty dict which produced no items, silently dropping the key. This fix checks whether the value is a non-empty dict before recursing; empty dicts and leaf values are now added directly to the current chunk. Fixes langchain-ai#29153
Add explicit dict[str, Any] type annotations to satisfy mypy var-annotated checks on complex nested dict literals.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes an off-by-one data loss bug in
RecursiveJsonSplitter._json_split()where dictionary entries with empty dict values ({}) at chunk boundaries were silently dropped during splitting.Closes #29153
Problem
When
_json_splitencounters a key-value pair that does not fit in the current chunk, it starts a new chunk and recursively calls itself on the value. However, when the value is an empty dict{}:isinstance(data, dict)→True{}.items()produces an empty iterator, so the for-loop body never executesThis manifests as items disappearing at chunk boundaries. For example, with the input from the issue,
GTMS-10andITSAMPLE-2are completely dropped from the output because they happen to land exactly on chunk edges.Root Cause
When
value = {},_json_split({}, path, chunks)is a no-op — the for-loop over{}.items()never executes, and the key-value pair is lost.Solution
Before recursing, check whether
valueis a non-empty dict. Only non-empty dicts benefit from recursive splitting. Empty dicts and all leaf values (strings, numbers, lists,None) are added directly to the current chunk:Tradeoffs Considered
isinstance+ truthiness check before the existing recursion. No API changes, no new parameters, no breaking changes.[]) are not dicts, so they already took theelsepath in the original recursion. This fix is consistent with that behavior.Tests Added
Three new test functions covering:
test_split_json_no_data_loss_on_chunk_boundary— Reproduces the exact scenario from issue Offset by 1 bug on RecursiveJsonSplitter::split_json() function #29153 with the same input data andmax_chunk_size=216. Verifies thatGTMS-10andITSAMPLE-2are no longer dropped.test_split_json_empty_dict_values_preserved— Tests that all 20 empty dict values in a synthetic dataset survive splitting with a small chunk size.test_split_json_non_dict_leaf_values_preserved— Tests that 30 string leaf values at chunk boundaries are preserved with correct key-value mapping.All existing tests continue to pass (125 passed, 4 skipped).
Verification