[ENHANCEMENT] Improve Web/URL ingestion with structural MarkdownHeaderTextSplitter

Hi @bharagha @14pankaj @krish918  @jgespino, I have a proposal for following enhancement

### Description of the Enhancement
Currently, the `document-ingestion/pgvector` microservice uses a `RecursiveCharacterTextSplitter` to chunk scraped web content. While functional for plain text, this approach is "blind" to document structure. When scraping documentation sites (which are converted to Markdown via `Html2TextTransformer`), the character splitter frequently slices through the middle of code blocks or multi-row tables to remain within character limits.

This results in "shattered" context where imports are separated from their functions or table headers are separated from their data, leading to hallucinations in RAG-based systems.

### Proposed Solution
Introduce a two-stage chunking pipeline for URL ingestion:
1. **Structural Splitting**: Use `MarkdownHeaderTextSplitter` to logically group content beneath its respective `#` and `##` headers. This ensures that semantic units (like an entire installation step or a Python script) are physically grouped together.
2. **Constraint Splitting**: Apply the existing token/character limits on top of these structural groups to ensure chunks fit within model context windows without losing their semantic integrity.

### Impact
- **Improved Retrieval**: Semantic units remain intact, significantly reducing LLM hallucinations caused by partial context.
- **Better Developer Experience**: Technical documentation (code/configs) becomes far more useful in the ChatQnA application.
- **Non-Destructive**: This specifically targets the web-scraping pipeline where Markdown headers are guaranteed to exist, avoiding unnecessary overhead for raw PDF/Docx text ingestion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] Improve Web/URL ingestion with structural MarkdownHeaderTextSplitter #1921

Description of the Enhancement

Proposed Solution

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT] Improve Web/URL ingestion with structural MarkdownHeaderTextSplitter #1921

Description

Description of the Enhancement

Proposed Solution

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions