Fix SemanticDoubleMergingSplitter stopword cleanup by fengjikui · Pull Request #22168 · run-llama/llama_index

fengjikui · 2026-06-27T16:35:40Z

Summary

replace sentence-level Punkt tokenization with word-level splitting after _clean_text_advanced lowercases and removes punctuation
add a regression test showing stopwords are removed from SemanticDoubleMergingSplitterNodeParser cleaned text

Root cause

_clean_text_advanced stripped punctuation before calling the Punkt sentence tokenizer. With no sentence-boundary punctuation left, Punkt returned the whole text as one token, so individual stopwords were never compared or removed.

Fixes #22166.

Validation

uv run --project llama-index-core python - <<'PY' ... reproduced the issue before the fix: output still contained stopwords and the assertion failed
uv run --project llama-index-core python - <<'PY' ... after the fix: output test text containing stopwords like
uv run --project llama-index-core pytest llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py -q -> 4 passed, 6 skipped
uv run --project llama-index-core ruff check llama-index-core/llama_index/core/node_parser/text/semantic_double_merging_splitter.py llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py
uv run --project llama-index-core ruff format --check llama-index-core/llama_index/core/node_parser/text/semantic_double_merging_splitter.py llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py

AI assistance was used to identify and draft this small fix; I reviewed the code path and verified it locally.

fix semantic splitter stopword cleanup

07b0d0e

dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix SemanticDoubleMergingSplitter stopword cleanup#22168

Fix SemanticDoubleMergingSplitter stopword cleanup#22168
fengjikui wants to merge 1 commit into
run-llama:mainfrom
fengjikui:codex/llama-index-stopword-tokenization

fengjikui commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fengjikui commented Jun 27, 2026

Summary

Root cause

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant