-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Question
Thank you for a great library.
I am using Docling 2.57.0.
I am attempting to import some web pages data using Docling (through Haiku RAG library). I have encountered a web page that is 1) not very complicated (3.5MB) 2) causes Docling chunker to hang.
The page causes some pathological behaviour, and HybridChunker hangs (never returns). I am running on a powerful Macbook M3 laptop. I assume this is because the Hacker News page in question uses the legacy HTML <table> element extensively in its web page layout.
This particular page is this Hacker News discussion page.
Below is a Python code to repeat the issue.
My goal is not to have the software hang under any circumstances.
My question is
- Is this a bug or a feature
- If it is a feature, how could I improve Docling's robustness so that I have some sort of fallback for a chunker / converter that would not hang if we detect such a web page that is dangerous to feed into it
Code to repeat:
"""A discussion heavy Hacker News web page hangs/overloads Docling."""
from pathlib import Path
import tiktoken
from docling.chunking import HybridChunker # type: ignore
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.types import DoclingDocument
chunk_size = 256
source_path = Path.cwd() / "deps" / "haiku.rag" / "examples" / "samples" / "hacker-news-discussion.html"
print(f"File size is {source_path.stat().st_size / 1024:.2f} KB")
tokenizer = OpenAITokenizer(
tokenizer=tiktoken.encoding_for_model("gpt-4o"),
max_tokens=chunk_size
)
chunker = HybridChunker(tokenizer=tokenizer)
converter = DocumentConverter()
conversion_result = converter.convert(source_path)
docling_document: DoclingDocument = conversion_result.document
print(f"Docling document has {len(docling_document.texts)} texts, {len(docling_document.tables)} tables, {len(docling_document.pictures)} pictures.")
# Too much for chunker to handle
print("Starting chunking...")
chunks = list(chunker.chunk(docling_document))
print(f"Generated {len(chunks)} chunks")Example run:
2025-10-22 19:06:20,267 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-10-22 19:06:21,771 - INFO - Going to convert document batch..., docs in batch: 1, doc_batch_concurrency: 1, doc_batch_size: 1
2025-10-22 19:06:21,771 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-10-22 19:06:21,778 - INFO - Loading plugin 'docling_defaults'
2025-10-22 19:06:21,778 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-22 19:06:21,778 - INFO - Processing document hacker-news-discussion.html
2025-10-22 19:06:42,491 - INFO - Finished converting document hacker-news-discussion.html in 22.23 sec.
Docling document has 22395 texts, 1904 tables, 7622 pictures.
Starting chunking...
The original haiku.rag issue here: ggozad/haiku.rag#112
CC @ggozad
The page in the question attached as HTML doc: