Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size#21900
Open
Incheonkirin wants to merge 1 commit into
Open
Conversation
… for oversized units A single indivisible unit larger than chunk_size (e.g. a multi-token CJK or emoji character with a small chunk_size) made _split recurse on the same text forever (RecursionError), after which TokenTextSplitter._merge popped from an empty list (IndexError). _split now keeps such a unit as an oversized split instead of recursing, so SentenceSplitter reaches its existing 'Single token exceeded chunk size' ValueError and TokenTextSplitter keeps it as a chunk. _merge stops trimming overlap once cur_chunk is empty. Adds regression tests; normal inputs unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I hit
I was chunking Korean text (insurance policy clauses) with a small
chunk_sizeand the splitters crashed instead of returning anything:The trigger is a single indivisible unit whose token count is already larger than
chunk_size. That's easy to hit with multi-token CJK characters and emoji —🚀is 3 tokens, and a single Korean syllable can be several.Why
_split(in bothtoken.pyandsentence.py) recurses on the same text once the split functions can no longer break it down — there's no base case, so it loops untilRecursionError. Once that's guarded,TokenTextSplitter._mergeruns its overlap-trim loop on the oversized split and pops from an emptycur_chunk, raisingIndexError.SentenceSplitter._mergeactually already handles this case — it raisesValueError("Single token exceeded chunk size")— but the recursion was crashing long before that line was ever reached.Fix
I tried to keep each splitter's existing intent rather than invent new behavior:
_splitnow keeps the indivisible unit as an oversized split instead of recursing.SentenceSplitterthen reaches its existingValueError, andTokenTextSplitterkeeps it as a chunk (matching its existing warn-and-keep handling).TokenTextSplitter._mergestops trimming overlap oncecur_chunkis empty.Normal inputs are unaffected. Added regression tests in
tests/text_splitter/test_splitter_oversized_unit.pycovering the oversized cases and a check that ordinary text still splits the same way.Note
There's also a separate, pre-existing malformed
logging.warningcall inTokenTextSplitter._merge(two positional args with no%placeholder) that raisesTypeErrorwhen it actually fires. That one already has open PRs (#21796, #21363), so I left it untouched here and just raised that logger's level in the new TokenTextSplitter test to keep this change focused on the recursion. Happy to rebase around whichever of those lands first.