Skip to content

[ENHANCEMENT] Improve Web/URL ingestion with structural MarkdownHeaderTextSplitter #1921

@ishaanv1709

Description

@ishaanv1709

Hi @bharagha @14pankaj @krish918 @jgespino, I have a proposal for following enhancement

Description of the Enhancement

Currently, the document-ingestion/pgvector microservice uses a RecursiveCharacterTextSplitter to chunk scraped web content. While functional for plain text, this approach is "blind" to document structure. When scraping documentation sites (which are converted to Markdown via Html2TextTransformer), the character splitter frequently slices through the middle of code blocks or multi-row tables to remain within character limits.

This results in "shattered" context where imports are separated from their functions or table headers are separated from their data, leading to hallucinations in RAG-based systems.

Proposed Solution

Introduce a two-stage chunking pipeline for URL ingestion:

  1. Structural Splitting: Use MarkdownHeaderTextSplitter to logically group content beneath its respective # and ## headers. This ensures that semantic units (like an entire installation step or a Python script) are physically grouped together.
  2. Constraint Splitting: Apply the existing token/character limits on top of these structural groups to ensure chunks fit within model context windows without losing their semantic integrity.

Impact

  • Improved Retrieval: Semantic units remain intact, significantly reducing LLM hallucinations caused by partial context.
  • Better Developer Experience: Technical documentation (code/configs) becomes far more useful in the ChatQnA application.
  • Non-Destructive: This specifically targets the web-scraping pipeline where Markdown headers are guaranteed to exist, avoiding unnecessary overhead for raw PDF/Docx text ingestion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions