-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Hi @bharagha @14pankaj @krish918 @jgespino, I have a proposal for following enhancement
Description of the Enhancement
Currently, the document-ingestion/pgvector microservice uses a RecursiveCharacterTextSplitter to chunk scraped web content. While functional for plain text, this approach is "blind" to document structure. When scraping documentation sites (which are converted to Markdown via Html2TextTransformer), the character splitter frequently slices through the middle of code blocks or multi-row tables to remain within character limits.
This results in "shattered" context where imports are separated from their functions or table headers are separated from their data, leading to hallucinations in RAG-based systems.
Proposed Solution
Introduce a two-stage chunking pipeline for URL ingestion:
- Structural Splitting: Use
MarkdownHeaderTextSplitterto logically group content beneath its respective#and##headers. This ensures that semantic units (like an entire installation step or a Python script) are physically grouped together. - Constraint Splitting: Apply the existing token/character limits on top of these structural groups to ensure chunks fit within model context windows without losing their semantic integrity.
Impact
- Improved Retrieval: Semantic units remain intact, significantly reducing LLM hallucinations caused by partial context.
- Better Developer Experience: Technical documentation (code/configs) becomes far more useful in the ChatQnA application.
- Non-Destructive: This specifically targets the web-scraping pipeline where Markdown headers are guaranteed to exist, avoiding unnecessary overhead for raw PDF/Docx text ingestion.