feat: upgrade URL ingestion to use MarkdownHeaderTextSplitter by ishaanv1709 · Pull Request #1922 · open-edge-platform/edge-ai-libraries

ishaanv1709 · 2026-03-09T12:35:42Z

Resolves #1921
@bharagha @14pankaj @jgespino @krish918

Description

This PR upgrades the chunking pipeline in url.py from a character-based approach to a structural one.

Problem: Websites converted to Markdown often contain code blocks or tables. Using a character splitter blindly slices through these, destroying context for the LLM.
Solution: Implemented a two-stage process using MarkdownHeaderTextSplitter. It first groups content by headers (#, ##), then applies token limits. This ensures structural elements (like Python scripts) stay intact.
Scope: Narrowed strictly to Web Ingestion. .docx and .pdf pipelines in document.py remain unchanged as they lack native Markdown headers.

Testing

Verified via local string-splitting evaluations using a pytest evaluation. The previous logic shattered code blocks across chunks; the new logic successfully retains them as unified semantic units.

bharagha · 2026-03-15T04:35:55Z

The document-ingestion is currently supporting text dominant URLs, word documents, text documents, and pdf. It is not intended to be a complete reference for variety of contents but illustrative of how Intel inference microservices can be used to quickly develop various applications. In your changes, you are supporting markdown contents in URL. Can you connect it back to the overall usecase on why this is relevant?

Secondly, please keep the changes in your forked repo for now. We will get back when it is ok to push back to main repo.

ishaanv1709 · 2026-03-15T07:58:12Z

Greetings @bharagha sir, thanks for the feedback.

1st question: I completely agree that the document-ingestion microservice should remain simple and illustrative rather than an exhaustive parser.

My reasoning for this specific change is that when demonstrating ChatQnA applications, a very common quick start use case is ingesting web URLs from technical documentation, GitHub READMEs, or tutorials. These websites are Markdown-heavy (containing code blocks and structured tables). When the URL ingestion flow uses a basic character splitter, it often slices code blocks and tables right down the middle, destroying the context for the LLM.

By utilizing the MarkdownHeaderTextSplitter specifically for the URL ingestion flow, we ensure the Intel inference microservices receive semantically whole chunks natively. It makes the ChatQnA sample application generate noticeably smarter, more accurate answers out-of-the-box for tech-focused URLs.

To assure you on the complexity this is a very lightweight change scoped strictly to url.py (only +22 -7 lines of code changes). It simply leverages a different splitter from the langchain-text-splitters library that the project is already using, meaning it adds zero new external dependencies or architectural overhead to the microservice.

Regarding your second point: Completely understood sir. I will leave these changes sitting in my fork for now.

bharagha · 2026-03-15T08:20:49Z

Thanks for the response. I have reviewed the code and see no issues with bringing it in. We will check pulling this specific PR to the mainline.

My question on why this PR is relevant was mainly in the context of GraphRAG problem statement. Let's ensure that cycles are more spent on that problem statement. I also looped in @naitik-2006 on this topic to collaborate with you.

ishaanv1709 · 2026-03-15T08:58:11Z

Thanks for the response. I have reviewed the code and see no issues with bringing it in. We will check pulling this specific PR to the mainline.

My question on why this PR is relevant was mainly in the context of GraphRAG problem statement. Let's ensure that cycles are more spent on that problem statement. I also looped in @naitik-2006 on this topic to collaborate with you.

Noted sir, will do further contribution around the problem statement and will loop with @naitik-2006 too

Supriya-Krishnamurthi · 2026-03-27T02:57:04Z