Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size by Incheonkirin · Pull Request #21900 · run-llama/llama_index

Incheonkirin · 2026-06-06T12:56:08Z

What I hit

I was chunking Korean text (insurance policy clauses) with a small chunk_size and the splitters crashed instead of returning anything:

from llama_index.core.node_parser import TokenTextSplitter, SentenceSplitter

TokenTextSplitter(chunk_size=1, chunk_overlap=0).split_text("🚀")        # RecursionError
SentenceSplitter(chunk_size=2, chunk_overlap=0).split_text("보험" * 50)   # RecursionError

The trigger is a single indivisible unit whose token count is already larger than chunk_size. That's easy to hit with multi-token CJK characters and emoji — 🚀 is 3 tokens, and a single Korean syllable can be several.

Why

_split (in both token.py and sentence.py) recurses on the same text once the split functions can no longer break it down — there's no base case, so it loops until RecursionError. Once that's guarded, TokenTextSplitter._merge runs its overlap-trim loop on the oversized split and pops from an empty cur_chunk, raising IndexError.

SentenceSplitter._merge actually already handles this case — it raises ValueError("Single token exceeded chunk size") — but the recursion was crashing long before that line was ever reached.

Fix

I tried to keep each splitter's existing intent rather than invent new behavior:

_split now keeps the indivisible unit as an oversized split instead of recursing. SentenceSplitter then reaches its existing ValueError, and TokenTextSplitter keeps it as a chunk (matching its existing warn-and-keep handling).
TokenTextSplitter._merge stops trimming overlap once cur_chunk is empty.

Normal inputs are unaffected. Added regression tests in tests/text_splitter/test_splitter_oversized_unit.py covering the oversized cases and a check that ordinary text still splits the same way.

Note

There's also a separate, pre-existing malformed logging.warning call in TokenTextSplitter._merge (two positional args with no % placeholder) that raises TypeError when it actually fires. That one already has open PRs (#21796, #21363), so I left it untouched here and just raised that logger's level in the new TokenTextSplitter test to keep this change focused on the recursion. Happy to rebase around whichever of those lands first.

… for oversized units A single indivisible unit larger than chunk_size (e.g. a multi-token CJK or emoji character with a small chunk_size) made _split recurse on the same text forever (RecursionError), after which TokenTextSplitter._merge popped from an empty list (IndexError). _split now keeps such a unit as an oversized split instead of recursing, so SentenceSplitter reaches its existing 'Single token exceeded chunk size' ValueError and TokenTextSplitter keeps it as a chunk. _merge stops trimming overlap once cur_chunk is empty. Adds regression tests; normal inputs unchanged.

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size#21900

Fix RecursionError in TokenTextSplitter & SentenceSplitter for units larger than chunk_size#21900
Incheonkirin wants to merge 1 commit into
run-llama:mainfrom
Incheonkirin:fix/splitter-oversized-unit-recursion

Incheonkirin commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Incheonkirin commented Jun 6, 2026

What I hit

Why

Fix

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant