Skip to content

feat: add optional embed_model to SemanticDoubleMergingSplitterNodeParser#20748

Open
MkDev11 wants to merge 2 commits intorun-llama:mainfrom
MkDev11:feature/15041-embedding-double-merging-splitter
Open

feat: add optional embed_model to SemanticDoubleMergingSplitterNodeParser#20748
MkDev11 wants to merge 2 commits intorun-llama:mainfrom
MkDev11:feature/15041-embedding-double-merging-splitter

Conversation

@MkDev11
Copy link

@MkDev11 MkDev11 commented Feb 19, 2026

Description

Adds optional embedding-model support to SemanticDoubleMergingSplitterNodeParser so users can chunk text in any language (e.g. via Hugging Face / sentence-transformers) without depending on Spacy. When embed_model is set, similarity is computed with BaseEmbedding.get_text_embedding_batch and similarity() instead of Spacy; when unset, existing Spacy + LanguageConfig behavior is unchanged. No new dependencies in llama-index-core; users supply an embedding (e.g. llama-index-embeddings-huggingface) if they want HF.

Closes #15041

New Package?

  • No

Version Bump?

  • No

Type of Change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • I added new unit tests to cover this change

Tests added: test_embed_model_path_returns_nodes, test_embed_model_similarity_in_range, test_embed_model_single_sentence_document in tests/node_parser/test_semantic_double_merging_splitter.py (use MockEmbedding; no Spacy required).

Run: cd llama-index-core && python3 -m pytest tests/node_parser/test_semantic_double_merging_splitter.py -v

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 19, 2026
@MkDev11
Copy link
Author

MkDev11 commented Feb 19, 2026

@AstraBert can you please review the PR and let me know your feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Double double-pass merging semantic chunker with Transformers and other languages

1 participant

Comments