Skip to content

feat: upgrade URL ingestion to use MarkdownHeaderTextSplitter#1922

Open
ishaanv1709 wants to merge 2 commits intoopen-edge-platform:mainfrom
ishaanv1709:feature/markdown-splitter
Open

feat: upgrade URL ingestion to use MarkdownHeaderTextSplitter#1922
ishaanv1709 wants to merge 2 commits intoopen-edge-platform:mainfrom
ishaanv1709:feature/markdown-splitter

Conversation

@ishaanv1709
Copy link
Copy Markdown

@ishaanv1709 ishaanv1709 commented Mar 9, 2026

Resolves #1921
@bharagha @14pankaj @jgespino @krish918

Description

This PR upgrades the chunking pipeline in url.py from a character-based approach to a structural one.

  • Problem: Websites converted to Markdown often contain code blocks or tables. Using a character splitter blindly slices through these, destroying context for the LLM.
  • Solution: Implemented a two-stage process using MarkdownHeaderTextSplitter. It first groups content by headers (#, ##), then applies token limits. This ensures structural elements (like Python scripts) stay intact.
  • Scope: Narrowed strictly to Web Ingestion. .docx and .pdf pipelines in document.py remain unchanged as they lack native Markdown headers.

Testing

Verified via local string-splitting evaluations using a pytest evaluation. The previous logic shattered code blocks across chunks; the new logic successfully retains them as unified semantic units.

Screenshot 2026-03-09 174651

@ishaanv1709 ishaanv1709 marked this pull request as ready for review March 9, 2026 12:38
@bharagha
Copy link
Copy Markdown
Contributor

The document-ingestion is currently supporting text dominant URLs, word documents, text documents, and pdf. It is not intended to be a complete reference for variety of contents but illustrative of how Intel inference microservices can be used to quickly develop various applications. In your changes, you are supporting markdown contents in URL. Can you connect it back to the overall usecase on why this is relevant?

Secondly, please keep the changes in your forked repo for now. We will get back when it is ok to push back to main repo.

@ishaanv1709
Copy link
Copy Markdown
Author

Greetings @bharagha sir, thanks for the feedback.

1st question: I completely agree that the document-ingestion microservice should remain simple and illustrative rather than an exhaustive parser.

My reasoning for this specific change is that when demonstrating ChatQnA applications, a very common quick start use case is ingesting web URLs from technical documentation, GitHub READMEs, or tutorials. These websites are Markdown-heavy (containing code blocks and structured tables). When the URL ingestion flow uses a basic character splitter, it often slices code blocks and tables right down the middle, destroying the context for the LLM.

By utilizing the MarkdownHeaderTextSplitter specifically for the URL ingestion flow, we ensure the Intel inference microservices receive semantically whole chunks natively. It makes the ChatQnA sample application generate noticeably smarter, more accurate answers out-of-the-box for tech-focused URLs.

To assure you on the complexity this is a very lightweight change scoped strictly to url.py (only +22 -7 lines of code changes). It simply leverages a different splitter from the langchain-text-splitters library that the project is already using, meaning it adds zero new external dependencies or architectural overhead to the microservice.

Regarding your second point: Completely understood sir. I will leave these changes sitting in my fork for now.

@bharagha
Copy link
Copy Markdown
Contributor

Thanks for the response. I have reviewed the code and see no issues with bringing it in. We will check pulling this specific PR to the mainline.

My question on why this PR is relevant was mainly in the context of GraphRAG problem statement. Let's ensure that cycles are more spent on that problem statement. I also looped in @naitik-2006 on this topic to collaborate with you.

@ishaanv1709
Copy link
Copy Markdown
Author

Thanks for the response. I have reviewed the code and see no issues with bringing it in. We will check pulling this specific PR to the mainline.

My question on why this PR is relevant was mainly in the context of GraphRAG problem statement. Let's ensure that cycles are more spent on that problem statement. I also looped in @naitik-2006 on this topic to collaborate with you.

Noted sir, will do further contribution around the problem statement and will loop with @naitik-2006 too


# Initialize text splitter and embedder once
text_splitter = RecursiveCharacterTextSplitter(
headers_to_split_on = [
Copy link
Copy Markdown
Contributor

@Supriya-Krishnamurthi Supriya-Krishnamurthi Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation assumes that all ingested content contains Markdown headers. In the current pipeline, content is derived from HTML and may not always preserve header structure in Markdown form. In such cases, it falls back to existing character-based splitting. Code doesnt break for non-md files but doesnt provide much benefits in such cases.

And when no headers are present, this fallback is implicit and not transparent. It would be better to explicitly detect the presence of structure and fallback to RecursiveCharacterTextSplitter when headers are not found, to make the behavior more predictable?

Since this is a sample application, should we introduce a config flag (e.g., enabling/disabling structured chunking) to keep the default (can be defaulted to disable) pipeline simple and easier to reason about. User can enable when sure about retrieved content being structured

("###", "Header 3"),
("####", "Header 4"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header-based splitting alone may not fully preserve semantic units like code blocks or tables. Not a blocker but we will still miss out on context for other such structured components in the content which may not fully align with the intent of this PR

# Chunk + embed
chunks = text_splitter.split_text(content)
metadata = [{"url": url}] * len(chunks)
md_header_splits = markdown_splitter.split_text(content)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hierarchical relationship between headers may not be fully preserved. While content is grouped under headers, the linkage between parent and child sections (e.g., subheaders B, C under A and E, F under D) is not explicitly maintained beyond individual chunks. Although MarkdownHeaderTextSplitter may initially capture this hierarchy in metadata, it can be lost after applying RecursiveCharacterTextSplitter and subsequently updating metadata (e.g., doc.metadata["url"] = url), which may overwrite existing fields. Request to validate such scenarios and share the repot

docs = char_splitter.split_documents(md_header_splits)

for doc in docs:
doc.metadata["url"] = url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check on the metadata handling here as the metadata from markdown-split can be overwritten. May be after checking flag for structured content splitting or text splitting, we can retain this metadata accordingly.

chunks = text_splitter.split_text(content)
metadata = [{"url": url}] * len(chunks)
md_header_splits = markdown_splitter.split_text(content)
docs = char_splitter.split_documents(md_header_splits)
Copy link
Copy Markdown
Contributor

@Supriya-Krishnamurthi Supriya-Krishnamurthi Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, if large amounts of content exist under a single header without any sub-headers, the subsequent character-based splitting can still split the large context in a similar way as the current approach

@bharagha
Copy link
Copy Markdown
Contributor

@ishaanv1709 : Please respond to the new comments added.

@ishaanv1709
Copy link
Copy Markdown
Author

@ishaanv1709 : Please respond to the new comments added.

Sure Sir I will check them right away and get back to you tonight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT] Improve Web/URL ingestion with structural MarkdownHeaderTextSplitter

3 participants