Submission checklist
Package (Required)
Related Issues / PRs
This bug is the result of the changes made in PR #34587.
Reproduction Steps / Example Code (Python)
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_content = """
<h1>Section 1</h1>
<p>This is some long text that <strong>should</strong> be split into multiple chunks due to the
small chunk size.</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")], max_chunk_size=50, chunk_overlap=5
)
documents = splitter.split_text(html_content)
# This results in the following result, where the text in the <strong> tag is moved to the beginning of the first chunk:
# [Document(metadata={'Header 1': 'Section 1'}, page_content='should This is some long text that be split into'),
# Document(metadata={'Header 1': 'Section 1'}, page_content='into multiple chunks due to the small chunk size.')]
# Expected:
# [Document(metadata={'Header 1': 'Section 1'}, page_content='This is some long text that should be split into'),
# Document(metadata={'Header 1': 'Section 1'}, page_content='into multiple chunks due to the small chunk size.')]
print(documents)
Error Message and Stack Trace (if applicable)
Description
When using HTMLSemanticPreservingSplitter, inline formatting tags cause text to be reordered before splitting.
In the example below, the word inside is moved to the beginning of the first chunk:
<p>This is some long text that <strong>should</strong> be split into multiple chunks...</p>
This results in "should This is some long text that be split into ..." instead of "This is some long text that should be split into ..."
System Info
System Information
OS: Linux
OS Version: #100-Ubuntu SMP Tue May 27 21:41:06 UTC 2025
Python Version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Package Information
langchain_core: 1.4.8
langsmith: 0.9.1
langchain_protocol: 0.0.18
langchain_text_splitters: 1.1.2
Optional packages not installed
deepagents
deepagents-cli
Other Dependencies
anyio: 4.14.0
distro: 1.9.0
httpx: 0.28.1
jsonpatch: 1.33
opentelemetry-api: 1.27.0
opentelemetry-sdk: 1.27.0
orjson: 3.11.9
packaging: 24.1
pydantic: 2.8.2
pyyaml: 6.0.1
requests: 2.32.2
requests-toolbelt: 1.0.0
rich: 15.0.0
sniffio: 1.3.1
tenacity: 8.2.2
typing-extensions: 4.15.0
uuid-utils: 0.16.2
websockets: 16.0
wrapt: 1.14.1
xxhash: 3.7.0
zstandard: 0.25.0
Submission checklist
Package (Required)
Related Issues / PRs
This bug is the result of the changes made in PR #34587.
Reproduction Steps / Example Code (Python)
Error Message and Stack Trace (if applicable)
Description
When using HTMLSemanticPreservingSplitter, inline formatting tags cause text to be reordered before splitting.
In the example below, the word inside is moved to the beginning of the first chunk:
<p>This is some long text that <strong>should</strong> be split into multiple chunks...</p>This results in "should This is some long text that be split into ..." instead of "This is some long text that should be split into ..."
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies