feat: upgrade URL ingestion to use MarkdownHeaderTextSplitter#1922
feat: upgrade URL ingestion to use MarkdownHeaderTextSplitter#1922ishaanv1709 wants to merge 2 commits intoopen-edge-platform:mainfrom
Conversation
|
The document-ingestion is currently supporting text dominant URLs, word documents, text documents, and pdf. It is not intended to be a complete reference for variety of contents but illustrative of how Intel inference microservices can be used to quickly develop various applications. In your changes, you are supporting markdown contents in URL. Can you connect it back to the overall usecase on why this is relevant? Secondly, please keep the changes in your forked repo for now. We will get back when it is ok to push back to main repo. |
|
Greetings @bharagha sir, thanks for the feedback. 1st question: I completely agree that the document-ingestion microservice should remain simple and illustrative rather than an exhaustive parser. My reasoning for this specific change is that when demonstrating ChatQnA applications, a very common quick start use case is ingesting web URLs from technical documentation, GitHub READMEs, or tutorials. These websites are Markdown-heavy (containing code blocks and structured tables). When the URL ingestion flow uses a basic character splitter, it often slices code blocks and tables right down the middle, destroying the context for the LLM. By utilizing the MarkdownHeaderTextSplitter specifically for the URL ingestion flow, we ensure the Intel inference microservices receive semantically whole chunks natively. It makes the ChatQnA sample application generate noticeably smarter, more accurate answers out-of-the-box for tech-focused URLs. To assure you on the complexity this is a very lightweight change scoped strictly to url.py (only +22 -7 lines of code changes). It simply leverages a different splitter from the langchain-text-splitters library that the project is already using, meaning it adds zero new external dependencies or architectural overhead to the microservice. Regarding your second point: Completely understood sir. I will leave these changes sitting in my fork for now. |
|
Thanks for the response. I have reviewed the code and see no issues with bringing it in. We will check pulling this specific PR to the mainline. My question on why this PR is relevant was mainly in the context of GraphRAG problem statement. Let's ensure that cycles are more spent on that problem statement. I also looped in @naitik-2006 on this topic to collaborate with you. |
Noted sir, will do further contribution around the problem statement and will loop with @naitik-2006 too |
|
|
||
| # Initialize text splitter and embedder once | ||
| text_splitter = RecursiveCharacterTextSplitter( | ||
| headers_to_split_on = [ |
There was a problem hiding this comment.
The implementation assumes that all ingested content contains Markdown headers. In the current pipeline, content is derived from HTML and may not always preserve header structure in Markdown form. In such cases, it falls back to existing character-based splitting. Code doesnt break for non-md files but doesnt provide much benefits in such cases.
And when no headers are present, this fallback is implicit and not transparent. It would be better to explicitly detect the presence of structure and fallback to RecursiveCharacterTextSplitter when headers are not found, to make the behavior more predictable?
Since this is a sample application, should we introduce a config flag (e.g., enabling/disabling structured chunking) to keep the default (can be defaulted to disable) pipeline simple and easier to reason about. User can enable when sure about retrieved content being structured
| ("###", "Header 3"), | ||
| ("####", "Header 4"), | ||
| ] | ||
| markdown_splitter = MarkdownHeaderTextSplitter( |
There was a problem hiding this comment.
Header-based splitting alone may not fully preserve semantic units like code blocks or tables. Not a blocker but we will still miss out on context for other such structured components in the content which may not fully align with the intent of this PR
| # Chunk + embed | ||
| chunks = text_splitter.split_text(content) | ||
| metadata = [{"url": url}] * len(chunks) | ||
| md_header_splits = markdown_splitter.split_text(content) |
There was a problem hiding this comment.
The hierarchical relationship between headers may not be fully preserved. While content is grouped under headers, the linkage between parent and child sections (e.g., subheaders B, C under A and E, F under D) is not explicitly maintained beyond individual chunks. Although MarkdownHeaderTextSplitter may initially capture this hierarchy in metadata, it can be lost after applying RecursiveCharacterTextSplitter and subsequently updating metadata (e.g., doc.metadata["url"] = url), which may overwrite existing fields. Request to validate such scenarios and share the repot
| docs = char_splitter.split_documents(md_header_splits) | ||
|
|
||
| for doc in docs: | ||
| doc.metadata["url"] = url |
There was a problem hiding this comment.
Double check on the metadata handling here as the metadata from markdown-split can be overwritten. May be after checking flag for structured content splitting or text splitting, we can retain this metadata accordingly.
| chunks = text_splitter.split_text(content) | ||
| metadata = [{"url": url}] * len(chunks) | ||
| md_header_splits = markdown_splitter.split_text(content) | ||
| docs = char_splitter.split_documents(md_header_splits) |
There was a problem hiding this comment.
Just checking, if large amounts of content exist under a single header without any sub-headers, the subsequent character-based splitting can still split the large context in a similar way as the current approach
|
@ishaanv1709 : Please respond to the new comments added. |
Sure Sir I will check them right away and get back to you tonight |
Resolves #1921
@bharagha @14pankaj @jgespino @krish918
Description
This PR upgrades the chunking pipeline in url.py from a character-based approach to a structural one.
MarkdownHeaderTextSplitter. It first groups content by headers (#,##), then applies token limits. This ensures structural elements (like Python scripts) stay intact..docxand.pdfpipelines in document.py remain unchanged as they lack native Markdown headers.Testing
Verified via local string-splitting evaluations using a pytest evaluation. The previous logic shattered code blocks across chunks; the new logic successfully retains them as unified semantic units.