-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
Right after I upgraded to Docling 2.55.0, around 30% of my HTML pages are not parsed. Such behavior was not observed on the Docling versions prior to 2.55.0. Hereafter is the simple step-by-step scenario to reproduce the issue. The HTML data is anonymized and limited to the scenario where the issue reproduces. Other cases may be present that I am not aware yet of. Also, I guess the issue might be related to #2324.
Steps to reproduce
The Python script to parse HTML page:
import logging
import json
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
from docling.backend.html_backend import HTMLDocumentBackend
file_path = "./bad.html"
file = Path(file_path)
in_doc = InputDocument(
path_or_stream=file,
format=InputFormat.HTML,
backend=HTMLDocumentBackend,
filename=file_path,
)
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=file)
try:
dl_doc = backend.convert()
docling_text = json.dumps(dl_doc.export_to_dict())
except Exception:
logging.exception(f"Unable to parse {file_path} using Docling")
The bad.html:
<div class="table-wrap">
<table class="wrapped fixed-table confluenceTable">
<colgroup>
<col style="width: 87.0px;" />
<col style="width: 99.0px;" />
<col style="width: 459.0px;" />
<col style="width: 547.0px;" />
<col style="width: 579.0px;" />
</colgroup>
<tbody>
<tr>
<th class="confluenceTh">...</th>
<th class="confluenceTh">
<p>Screen</p>
</th>
<th class="confluenceTh">...</th>
<th class="confluenceTh">...</th>
<th class="confluenceTh">...</th>
</tr>
<tr>
<td class="confluenceTd">
<h2 id="...">...</h2>
</td>
<td class="confluenceTd"><br /></td>
<td class="confluenceTd"><br /></td>
<td class="confluenceTd"><br /></td>
<td class="confluenceTd"><br /></td>
</tr>
<tr>
<td class="confluenceTd">...</td>
<td class="confluenceTd">...</td>
<td class="confluenceTd">
<div class="content-wrapper">
<p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image" draggable="false" height="250"
src="..."></span>
</p>
<p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image" draggable="false" height="250"
src="..."></span>
</p>
<p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image confluence-thumbnail" draggable="false" height="65"
src="..."></span><span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image confluence-thumbnail" draggable="false" height="67"
src="..."></span>
</p>
</div>
</td>
<td class="confluenceTd">
<p>...</p>
<ul>
<li>...</li>
<li>...</li>
</ul>
<p>...</p>
<ul>
<li>...</li>
<li>...</li>
</ul>
</td>
<td class="confluenceTd"><br /></td>
</tr>
<tr>
<td class="confluenceTd">...</td>
<td class="confluenceTd">
<h3 id="...">...</h3>
</td>
<td class="confluenceTd">
<div class="content-wrapper">
<p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image" draggable="false" height="250"
src="..."></span>
</p>
</div>
</td>
<td class="confluenceTd">
<p>...<span style="color: rgb(255,0,0);">...
<img class="emoticon emoticon-question"
src="..."
data-emoticon-name="question" alt="(question)"
data-emoji-short-name=":question:" /></span>...</p>
<h4 id="...">...</h4>
<ul>
<li>...</li>
<li><span style="color: rgb(255,0,0);">...<img class="emoticon emoticon-question"
src="..."
data-emoticon-name="question" alt="(question)" data-emoji-short-name=":question:" />
...</span></li>
</ul>
<h4 id="...">...</h4>
<p>...<span
style="color: rgb(255,0,0);">...</span> <img class="emoticon emoticon-question"
src="..."
data-emoticon-name="question" alt="(question)" data-emoji-short-name=":question:" />...</p>
<p><span style="color: rgb(255,0,0);">...<img class="emoticon emoticon-question"
src="..."
data-emoticon-name="question" alt="(question)"
data-emoji-short-name=":question:" /></span></p>
<h4 id="...">...</h4>
<div class="table-wrap">
<table class="wrapped confluenceTable">
<tbody>
<tr>
<th class="confluenceTh">...</th>
<th class="confluenceTh">...</th>
<th class="confluenceTh">...</th>
</tr>
<tr>
<td class="confluenceTd">...</td>
<td class="confluenceTd">...<span
style="color: rgb(255,0,0);">...</span></td>
<td class="confluenceTd">...</td>
</tr>
<tr>
<td class="confluenceTd">...</td>
<td class="confluenceTd">...<span
style="color: rgb(255,0,0);">...</span></td>
<td class="confluenceTd">...</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
<tr>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">
<h3 id="...">...</h3>
</td>
<td colspan="1" class="confluenceTd">
<div class="content-wrapper">
<p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
class="confluence-embedded-image" draggable="false" height="250"
src="..."></span>
</p>
</div>
</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd"><br /></td>
</tr>
</tbody>
</table>
</div>
The error:
ERROR:root:Unable to parse ./bad.html using Docling
Traceback (most recent call last):
File "/home/docqa/test_docling.py", line 21, in <module>
dl_doc = backend.convert()
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 281, in convert
self._walk(content, doc)
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 521, in _walk
wk3 = self._walk(node, doc)
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 517, in _walk
blk = self._handle_block(node, doc)
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 1038, in _handle_block
self.parse_table_data(tag, doc, docling_table, num_rows, num_cols)
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 401, in parse_table_data
HTMLDocumentBackend.process_rich_table_cells(
File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 327, in process_rich_table_cells
doc.delete_items(node_items=[pr_item])
File "/home/docqa/.venv/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 2057, in delete_items
self._delete_items(refs=refs)
File "/home/docqa/.venv/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 2259, in _delete_items
raise ValueError(
ValueError: Cannot find all provided RefItems in doc: ['#/texts/0']
Docling version
2025-10-01 14:11:57,666 - INFO - Loading plugin 'docling_defaults'
2025-10-01 14:11:57,667 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.55.0
Docling Core version: 2.48.3
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python: cpython-310 (3.10.12)
Platform: Linux-6.8.0-79-generic-x86_64-with-glibc2.35
Python version
Python 3.10.12
maxmnemonic
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working