Skip to content

Chunking Hacker News discussion page: HybridChunker hangs #2510

@miohtama

Description

@miohtama

Question

Thank you for a great library.

I am using Docling 2.57.0.

I am attempting to import some web pages data using Docling (through Haiku RAG library). I have encountered a web page that is 1) not very complicated (3.5MB) 2) causes Docling chunker to hang.

The page causes some pathological behaviour, and HybridChunker hangs (never returns). I am running on a powerful Macbook M3 laptop. I assume this is because the Hacker News page in question uses the legacy HTML <table> element extensively in its web page layout.

This particular page is this Hacker News discussion page.

Below is a Python code to repeat the issue.

My goal is not to have the software hang under any circumstances.

My question is

  • Is this a bug or a feature
  • If it is a feature, how could I improve Docling's robustness so that I have some sort of fallback for a chunker / converter that would not hang if we detect such a web page that is dangerous to feed into it

hacker-news-discussion.html

Code to repeat:

"""A discussion heavy Hacker News web page hangs/overloads Docling."""
from pathlib import Path

import tiktoken
from docling.chunking import HybridChunker  # type: ignore
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.types import DoclingDocument

chunk_size = 256
source_path = Path.cwd() / "deps" / "haiku.rag" / "examples" / "samples" / "hacker-news-discussion.html"

print(f"File size is {source_path.stat().st_size / 1024:.2f} KB")

tokenizer = OpenAITokenizer(
    tokenizer=tiktoken.encoding_for_model("gpt-4o"),
    max_tokens=chunk_size
)
chunker = HybridChunker(tokenizer=tokenizer)

converter = DocumentConverter()
conversion_result = converter.convert(source_path)
docling_document: DoclingDocument = conversion_result.document

print(f"Docling document has {len(docling_document.texts)} texts, {len(docling_document.tables)} tables, {len(docling_document.pictures)} pictures.")

# Too much for chunker to handle
print("Starting chunking...")
chunks = list(chunker.chunk(docling_document))

print(f"Generated {len(chunks)} chunks")

Example run:

2025-10-22 19:06:20,267 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-10-22 19:06:21,771 - INFO - Going to convert document batch..., docs in batch: 1, doc_batch_concurrency: 1, doc_batch_size: 1
2025-10-22 19:06:21,771 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-10-22 19:06:21,778 - INFO - Loading plugin 'docling_defaults'
2025-10-22 19:06:21,778 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-22 19:06:21,778 - INFO - Processing document hacker-news-discussion.html
2025-10-22 19:06:42,491 - INFO - Finished converting document hacker-news-discussion.html in 22.23 sec.
Docling document has 22395 texts, 1904 tables, 7622 pictures.
Starting chunking...

The original haiku.rag issue here: ggozad/haiku.rag#112

CC @ggozad

The page in the question attached as HTML doc:

hacker-news-discussion.html

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghtmlissue related to html backendquestionFurther information is requested

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions