Bug Description
Bug Description
In the SemanticDoubleMergingSplitterNodeParser text splitter, the internal _clean_text_advanced method attempts to clean strings and remove stopwords before calculating sentence similarity (when using the SpaCy backend).
However, stopword removal is currently completely ineffective due to a tokenization bug:
- Punctuation is stripped from the text first:
text = text.translate(str.maketrans("", "", string.punctuation))
- The method then tokenizes the text using a sentence tokenizer:
tokens = globals_helper.punkt_tokenizer.tokenize(text)
- Since all punctuation has already been stripped, the sentence tokenizer treats the entire text block as a single sentence. Thus,
tokens contains only one element—the entire string.
- Consequently, the check
w not in self.language_config.stopwords is performed against the entire text block as a single string (rather than word-by-word), so individual stopwords are never filtered out.
Version
0.14.23
Steps to Reproduce
Run the following script to observe that stopwords (like this, is, a, some, the, and) are not removed at all by _clean_text_advanced:
from llama_index.core.node_parser.text.semantic_double_merging_splitter import SemanticDoubleMergingSplitterNodeParser
from llama_index.core.utils import globals_helper
# Instantiate the splitter
splitter = SemanticDoubleMergingSplitterNodeParser()
# Manually load the default English stopwords (simulating the behavior of load_model())
splitter.language_config.stopwords = set(globals_helper.stopwords)
input_text = "this is a test text containing some stopwords like the and a"
cleaned_text = splitter._clean_text_advanced(input_text)
print("Original text:", input_text)
print("Cleaned text :", cleaned_text)
# Expected: "test text containing stopwords like" (stopwords removed)
# Actual : "this is a test text containing some stopwords like the and a" (no change)
assert cleaned_text != input_text, "Stopwords were not removed!"
Relevant Logs/Tracebacks
Bug Description
Bug Description
In the
SemanticDoubleMergingSplitterNodeParsertext splitter, the internal_clean_text_advancedmethod attempts to clean strings and remove stopwords before calculating sentence similarity (when using the SpaCy backend).However, stopword removal is currently completely ineffective due to a tokenization bug:
tokenscontains only one element—the entire string.w not in self.language_config.stopwordsis performed against the entire text block as a single string (rather than word-by-word), so individual stopwords are never filtered out.Version
0.14.23
Steps to Reproduce
Run the following script to observe that stopwords (like this, is, a, some, the, and) are not removed at all by
_clean_text_advanced:Relevant Logs/Tracebacks