Skip to content

Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size#21900

Open
Incheonkirin wants to merge 1 commit into
run-llama:mainfrom
Incheonkirin:fix/splitter-oversized-unit-recursion
Open

Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size#21900
Incheonkirin wants to merge 1 commit into
run-llama:mainfrom
Incheonkirin:fix/splitter-oversized-unit-recursion

Conversation

@Incheonkirin
Copy link
Copy Markdown

What I hit

I was chunking Korean text (insurance policy clauses) with a small chunk_size and the splitters crashed instead of returning anything:

from llama_index.core.node_parser import TokenTextSplitter, SentenceSplitter

TokenTextSplitter(chunk_size=1, chunk_overlap=0).split_text("🚀")        # RecursionError
SentenceSplitter(chunk_size=2, chunk_overlap=0).split_text("보험" * 50)   # RecursionError

The trigger is a single indivisible unit whose token count is already larger than chunk_size. That's easy to hit with multi-token CJK characters and emoji — 🚀 is 3 tokens, and a single Korean syllable can be several.

Why

_split (in both token.py and sentence.py) recurses on the same text once the split functions can no longer break it down — there's no base case, so it loops until RecursionError. Once that's guarded, TokenTextSplitter._merge runs its overlap-trim loop on the oversized split and pops from an empty cur_chunk, raising IndexError.

SentenceSplitter._merge actually already handles this case — it raises ValueError("Single token exceeded chunk size") — but the recursion was crashing long before that line was ever reached.

Fix

I tried to keep each splitter's existing intent rather than invent new behavior:

  • _split now keeps the indivisible unit as an oversized split instead of recursing. SentenceSplitter then reaches its existing ValueError, and TokenTextSplitter keeps it as a chunk (matching its existing warn-and-keep handling).
  • TokenTextSplitter._merge stops trimming overlap once cur_chunk is empty.

Normal inputs are unaffected. Added regression tests in tests/text_splitter/test_splitter_oversized_unit.py covering the oversized cases and a check that ordinary text still splits the same way.

Note

There's also a separate, pre-existing malformed logging.warning call in TokenTextSplitter._merge (two positional args with no % placeholder) that raises TypeError when it actually fires. That one already has open PRs (#21796, #21363), so I left it untouched here and just raised that logger's level in the new TokenTextSplitter test to keep this change focused on the recursion. Happy to rebase around whichever of those lands first.

… for oversized units

A single indivisible unit larger than chunk_size (e.g. a multi-token CJK or
emoji character with a small chunk_size) made _split recurse on the same text
forever (RecursionError), after which TokenTextSplitter._merge popped from an
empty list (IndexError).

_split now keeps such a unit as an oversized split instead of recursing, so
SentenceSplitter reaches its existing 'Single token exceeded chunk size'
ValueError and TokenTextSplitter keeps it as a chunk. _merge stops trimming
overlap once cur_chunk is empty. Adds regression tests; normal inputs unchanged.
@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant