fix(core): use word tokenization for stopword removal in SemanticDoubleMergingSplitterNodeParser by rautaditya2606 · Pull Request #22167 · run-llama/llama_index

rautaditya2606 · 2026-06-27T14:25:31Z

Description

Fix stopword removal in SemanticDoubleMergingSplitterNodeParser.

The internal _clean_text_advanced method was using the sentence-level punkt_tokenizer.tokenize on punctuation-free text. This caused the entire string to be treated as a single token, rendering the stopword filter ineffective. This PR changes the tokenizer to use re.findall(r"\w+", text) to correctly extract word-level tokens, matching existing word-tokenization conventions in the repository.

Fixes #22166

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No (Changes are in llama-index-core)

Type of Change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Validated locally by running:

uv run pytest llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py

Result: 4 passed, 6 skipped.

Suggested Checklist

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks
My changes generate no new warnings
I have added tests that prove my fix is effective
New and existing unit tests pass locally with my changes
I ran uv run ruff check and uv run black --check

fix(core): fix stopword filtering in semantic double merging splitter

41af100

dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(core): use word tokenization for stopword removal in SemanticDoubleMergingSplitterNodeParser#22167

fix(core): use word tokenization for stopword removal in SemanticDoubleMergingSplitterNodeParser#22167
rautaditya2606 wants to merge 1 commit into
run-llama:mainfrom
rautaditya2606:fix/semantic-splitter-stopword-removal

rautaditya2606 commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rautaditya2606 commented Jun 27, 2026

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant