Chunking Hacker News discussion page: HybridChunker hangs

### Question

Thank you for a great library.

I am using Docling 2.57.0.

I am attempting to import some web pages data using Docling (through [Haiku RAG](https://github.com/ggozad/haiku.rag) library). I have encountered a web page that is 1) not very complicated (3.5MB) 2) causes Docling chunker to hang.

The page causes some pathological behaviour, and `HybridChunker` hangs (never returns). I am running on a powerful Macbook M3 laptop. I assume this is because the Hacker News page in question uses the legacy HTML `<table>` element extensively in its web page layout.

This particular page is this [Hacker News discussion page](https://news.ycombinator.com/item?id=45640838).

Below is a Python code to repeat the issue.

My goal is not to have the software hang under any circumstances.

My question is
- Is this a bug or a feature
- If it is a feature, how could I improve Docling's robustness so that I have some sort of fallback for a chunker / converter that would not hang if we detect such a web page that is dangerous to feed into it

[hacker-news-discussion.html](https://github.com/user-attachments/files/23058884/hacker-news-discussion.html)

Code to repeat:

```python
"""A discussion heavy Hacker News web page hangs/overloads Docling."""
from pathlib import Path

import tiktoken
from docling.chunking import HybridChunker  # type: ignore
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.types import DoclingDocument

chunk_size = 256
source_path = Path.cwd() / "deps" / "haiku.rag" / "examples" / "samples" / "hacker-news-discussion.html"

print(f"File size is {source_path.stat().st_size / 1024:.2f} KB")

tokenizer = OpenAITokenizer(
    tokenizer=tiktoken.encoding_for_model("gpt-4o"),
    max_tokens=chunk_size
)
chunker = HybridChunker(tokenizer=tokenizer)

converter = DocumentConverter()
conversion_result = converter.convert(source_path)
docling_document: DoclingDocument = conversion_result.document

print(f"Docling document has {len(docling_document.texts)} texts, {len(docling_document.tables)} tables, {len(docling_document.pictures)} pictures.")

# Too much for chunker to handle
print("Starting chunking...")
chunks = list(chunker.chunk(docling_document))

print(f"Generated {len(chunks)} chunks")
```

Example run:

```
2025-10-22 19:06:20,267 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-10-22 19:06:21,771 - INFO - Going to convert document batch..., docs in batch: 1, doc_batch_concurrency: 1, doc_batch_size: 1
2025-10-22 19:06:21,771 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-10-22 19:06:21,778 - INFO - Loading plugin 'docling_defaults'
2025-10-22 19:06:21,778 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-22 19:06:21,778 - INFO - Processing document hacker-news-discussion.html
2025-10-22 19:06:42,491 - INFO - Finished converting document hacker-news-discussion.html in 22.23 sec.
Docling document has 22395 texts, 1904 tables, 7622 pictures.
Starting chunking...
```

The original haiku.rag issue here: https://github.com/ggozad/haiku.rag/issues/112

CC @ggozad

The page in the question attached as HTML doc:

[hacker-news-discussion.html](https://github.com/user-attachments/files/23058887/hacker-news-discussion.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunking Hacker News discussion page: HybridChunker hangs #2510

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chunking Hacker News discussion page: HybridChunker hangs #2510

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions