Skip to content

fix(core): use word tokenization for stopword removal in SemanticDoubleMergingSplitterNodeParser#22167

Open
rautaditya2606 wants to merge 1 commit into
run-llama:mainfrom
rautaditya2606:fix/semantic-splitter-stopword-removal
Open

fix(core): use word tokenization for stopword removal in SemanticDoubleMergingSplitterNodeParser#22167
rautaditya2606 wants to merge 1 commit into
run-llama:mainfrom
rautaditya2606:fix/semantic-splitter-stopword-removal

Conversation

@rautaditya2606

Copy link
Copy Markdown

Description

Fix stopword removal in SemanticDoubleMergingSplitterNodeParser.

The internal _clean_text_advanced method was using the sentence-level punkt_tokenizer.tokenize on punctuation-free text. This caused the entire string to be treated as a single token, rendering the stopword filter ineffective. This PR changes the tokenizer to use re.findall(r"\w+", text) to correctly extract word-level tokens, matching existing word-tokenization conventions in the repository.

Fixes #22166

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No (Changes are in llama-index-core)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Validated locally by running:

uv run pytest llama-index-core/tests/node_parser/test_semantic_double_merging_splitter.py

Result: 4 passed, 6 skipped.

Suggested Checklist

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes
  • I ran uv run ruff check and uv run black --check

@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Stopword removal is ineffective in SemanticDoubleMergingSplitterNodeParser

1 participant