Skip to content

HTMLSemanticPreservingSplitter reorders inline text around tags (e.g. <strong>, <a>, etc.) #38404

Description

@evisser

Submission checklist

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Related Issues / PRs

This bug is the result of the changes made in PR #34587.

Reproduction Steps / Example Code (Python)

from langchain_text_splitters import HTMLSemanticPreservingSplitter

html_content = """
    <h1>Section 1</h1>
    <p>This is some long text that <strong>should</strong> be split into multiple chunks due to the
    small chunk size.</p>
    """

splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[("h1", "Header 1")], max_chunk_size=50, chunk_overlap=5
)
documents = splitter.split_text(html_content)

# This results in the following result, where the text in the <strong> tag is moved to the beginning of the first chunk:
# [Document(metadata={'Header 1': 'Section 1'}, page_content='should This is some long text that be split into'),
# Document(metadata={'Header 1': 'Section 1'}, page_content='into multiple chunks due to the small chunk size.')]

# Expected:
# [Document(metadata={'Header 1': 'Section 1'}, page_content='This is some long text that should be split into'),
# Document(metadata={'Header 1': 'Section 1'}, page_content='into multiple chunks due to the small chunk size.')]

print(documents)

Error Message and Stack Trace (if applicable)

Description

When using HTMLSemanticPreservingSplitter, inline formatting tags cause text to be reordered before splitting.
In the example below, the word inside is moved to the beginning of the first chunk:
<p>This is some long text that <strong>should</strong> be split into multiple chunks...</p>

This results in "should This is some long text that be split into ..." instead of "This is some long text that should be split into ..."

System Info

System Information

OS: Linux
OS Version: #100-Ubuntu SMP Tue May 27 21:41:06 UTC 2025
Python Version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]

Package Information

langchain_core: 1.4.8
langsmith: 0.9.1
langchain_protocol: 0.0.18
langchain_text_splitters: 1.1.2

Optional packages not installed

deepagents
deepagents-cli

Other Dependencies

anyio: 4.14.0
distro: 1.9.0
httpx: 0.28.1
jsonpatch: 1.33
opentelemetry-api: 1.27.0
opentelemetry-sdk: 1.27.0
orjson: 3.11.9
packaging: 24.1
pydantic: 2.8.2
pyyaml: 6.0.1
requests: 2.32.2
requests-toolbelt: 1.0.0
rich: 15.0.0
sniffio: 1.3.1
tenacity: 8.2.2
typing-extensions: 4.15.0
uuid-utils: 0.16.2
websockets: 16.0
wrapt: 1.14.1
xxhash: 3.7.0
zstandard: 0.25.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing featureexternaltext-splittersRelated to the package `text-splitters`

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions