Skip to content

Fix SemanticDoubleMergingSplitter stopword cleanup#22168

Open
fengjikui wants to merge 1 commit into
run-llama:mainfrom
fengjikui:codex/llama-index-stopword-tokenization
Open

Fix SemanticDoubleMergingSplitter stopword cleanup#22168
fengjikui wants to merge 1 commit into
run-llama:mainfrom
fengjikui:codex/llama-index-stopword-tokenization

Conversation

@fengjikui

Copy link
Copy Markdown

Summary

  • replace sentence-level Punkt tokenization with word-level splitting after _clean_text_advanced lowercases and removes punctuation
  • add a regression test showing stopwords are removed from SemanticDoubleMergingSplitterNodeParser cleaned text

Root cause

_clean_text_advanced stripped punctuation before calling the Punkt sentence tokenizer. With no sentence-boundary punctuation left, Punkt returned the whole text as one token, so individual stopwords were never compared or removed.

Fixes #22166.

Validation

  • uv run --project llama-index-core python - <<'PY' ... reproduced the issue before the fix: output still contained stopwords and the assertion failed
  • uv run --project llama-index-core python - <<'PY' ... after the fix: output test text containing stopwords like
  • uv run --project llama-index-core pytest llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py -q -> 4 passed, 6 skipped
  • uv run --project llama-index-core ruff check llama-index-core/llama_index/core/node_parser/text/semantic_double_merging_splitter.py llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py
  • uv run --project llama-index-core ruff format --check llama-index-core/llama_index/core/node_parser/text/semantic_double_merging_splitter.py llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py

AI assistance was used to identify and draft this small fix; I reviewed the code path and verified it locally.

@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Stopword removal is ineffective in SemanticDoubleMergingSplitterNodeParser

1 participant