Skip to content

[Bug]: Stopword removal is ineffective in SemanticDoubleMergingSplitterNodeParser #22166

Description

@rautaditya2606

Bug Description

Bug Description

In the SemanticDoubleMergingSplitterNodeParser text splitter, the internal _clean_text_advanced method attempts to clean strings and remove stopwords before calculating sentence similarity (when using the SpaCy backend).

However, stopword removal is currently completely ineffective due to a tokenization bug:

  1. Punctuation is stripped from the text first:
    text = text.translate(str.maketrans("", "", string.punctuation))
  2. The method then tokenizes the text using a sentence tokenizer:
    tokens = globals_helper.punkt_tokenizer.tokenize(text)
  3. Since all punctuation has already been stripped, the sentence tokenizer treats the entire text block as a single sentence. Thus, tokens contains only one element—the entire string.
  4. Consequently, the check w not in self.language_config.stopwords is performed against the entire text block as a single string (rather than word-by-word), so individual stopwords are never filtered out.

Version

0.14.23

Steps to Reproduce

Run the following script to observe that stopwords (like this, is, a, some, the, and) are not removed at all by _clean_text_advanced:

from llama_index.core.node_parser.text.semantic_double_merging_splitter import SemanticDoubleMergingSplitterNodeParser
from llama_index.core.utils import globals_helper

# Instantiate the splitter
splitter = SemanticDoubleMergingSplitterNodeParser()

# Manually load the default English stopwords (simulating the behavior of load_model())
splitter.language_config.stopwords = set(globals_helper.stopwords)

input_text = "this is a test text containing some stopwords like the and a"
cleaned_text = splitter._clean_text_advanced(input_text)

print("Original text:", input_text)
print("Cleaned text :", cleaned_text)

# Expected: "test text containing stopwords like" (stopwords removed)
# Actual  : "this is a test text containing some stopwords like the and a" (no change)
assert cleaned_text != input_text, "Stopwords were not removed!"

Relevant Logs/Tracebacks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions