Skip to content

~30% of my HTML pages are not converted after upgrade to Docling 2.55.0 #2360

@tysonite

Description

@tysonite

Bug

Right after I upgraded to Docling 2.55.0, around 30% of my HTML pages are not parsed. Such behavior was not observed on the Docling versions prior to 2.55.0. Hereafter is the simple step-by-step scenario to reproduce the issue. The HTML data is anonymized and limited to the scenario where the issue reproduces. Other cases may be present that I am not aware yet of. Also, I guess the issue might be related to #2324.

Steps to reproduce

The Python script to parse HTML page:

import logging
import json
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
from docling.backend.html_backend import HTMLDocumentBackend

file_path = "./bad.html"
file = Path(file_path)


in_doc = InputDocument(
    path_or_stream=file,
    format=InputFormat.HTML,
    backend=HTMLDocumentBackend,
    filename=file_path,
)
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=file)
try:
    dl_doc = backend.convert()
    docling_text = json.dumps(dl_doc.export_to_dict())
except Exception:
    logging.exception(f"Unable to parse {file_path} using Docling")

The bad.html:


<div class="table-wrap">
    <table class="wrapped fixed-table confluenceTable">
        <colgroup>
            <col style="width: 87.0px;" />
            <col style="width: 99.0px;" />
            <col style="width: 459.0px;" />
            <col style="width: 547.0px;" />
            <col style="width: 579.0px;" />
        </colgroup>
        <tbody>
            <tr>
                <th class="confluenceTh">...</th>
                <th class="confluenceTh">
                    <p>Screen</p>
                </th>
                <th class="confluenceTh">...</th>
                <th class="confluenceTh">...</th>
                <th class="confluenceTh">...</th>
            </tr>
            <tr>
                <td class="confluenceTd">
                    <h2 id="...">...</h2>
                </td>
                <td class="confluenceTd"><br /></td>
                <td class="confluenceTd"><br /></td>
                <td class="confluenceTd"><br /></td>
                <td class="confluenceTd"><br /></td>
            </tr>
            <tr>
                <td class="confluenceTd">...</td>
                <td class="confluenceTd">...</td>
                <td class="confluenceTd">
                    <div class="content-wrapper">
                        <p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image" draggable="false" height="250"
                                    src="..."></span>
                        </p>
                        <p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image" draggable="false" height="250"
                                    src="..."></span>
                        </p>
                        <p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image confluence-thumbnail" draggable="false" height="65"
                                    src="..."></span><span
                                class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image confluence-thumbnail" draggable="false" height="67"
                                    src="..."></span>
                        </p>
                    </div>
                </td>
                <td class="confluenceTd">
                    <p>...</p>
                    <ul>
                        <li>...</li>
                        <li>...</li>
                    </ul>
                    <p>...</p>
                    <ul>
                        <li>...</li>
                        <li>...</li>
                    </ul>
                </td>
                <td class="confluenceTd"><br /></td>
            </tr>
            <tr>
                <td class="confluenceTd">...</td>
                <td class="confluenceTd">
                    <h3 id="...">...</h3>
                </td>
                <td class="confluenceTd">
                    <div class="content-wrapper">
                        <p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image" draggable="false" height="250"
                                    src="..."></span>
                        </p>
                    </div>
                </td>
                <td class="confluenceTd">
                    <p>...<span style="color: rgb(255,0,0);">...
                            <img class="emoticon emoticon-question"
                                src="..."
                                data-emoticon-name="question" alt="(question)"
                                data-emoji-short-name=":question:" /></span>...</p>
                    <h4 id="...">...</h4>
                    <ul>
                        <li>...</li>
                        <li><span style="color: rgb(255,0,0);">...<img class="emoticon emoticon-question"
                                    src="..."
                                    data-emoticon-name="question" alt="(question)" data-emoji-short-name=":question:" />
                                ...</span></li>
                    </ul>
                    <h4 id="...">...</h4>
                    <p>...<span
                            style="color: rgb(255,0,0);">...</span> <img class="emoticon emoticon-question"
                            src="..."
                            data-emoticon-name="question" alt="(question)" data-emoji-short-name=":question:" />...</p>
                    <p><span style="color: rgb(255,0,0);">...<img class="emoticon emoticon-question"
                                src="..."
                                data-emoticon-name="question" alt="(question)"
                                data-emoji-short-name=":question:" /></span></p>
                    <h4 id="...">...</h4>
                    <div class="table-wrap">
                        <table class="wrapped confluenceTable">
                            <tbody>
                                <tr>
                                    <th class="confluenceTh">...</th>
                                    <th class="confluenceTh">...</th>
                                    <th class="confluenceTh">...</th>
                                </tr>
                                <tr>
                                    <td class="confluenceTd">...</td>
                                    <td class="confluenceTd">...<span
                                            style="color: rgb(255,0,0);">...</span></td>
                                    <td class="confluenceTd">...</td>
                                </tr>
                                <tr>
                                    <td class="confluenceTd">...</td>
                                    <td class="confluenceTd">...<span
                                            style="color: rgb(255,0,0);">...</span></td>
                                    <td class="confluenceTd">...</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td colspan="1" class="confluenceTd">...</td>
                <td colspan="1" class="confluenceTd">
                    <h3 id="...">...</h3>
                </td>
                <td colspan="1" class="confluenceTd">
                    <div class="content-wrapper">
                        <p><span class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img
                                    class="confluence-embedded-image" draggable="false" height="250"
                                    src="..."></span>
                        </p>
                    </div>
                </td>
                <td colspan="1" class="confluenceTd">...</td>
                <td colspan="1" class="confluenceTd"><br /></td>
            </tr>
        </tbody>
    </table>
</div>

The error:

ERROR:root:Unable to parse ./bad.html using Docling
Traceback (most recent call last):
  File "/home/docqa/test_docling.py", line 21, in <module>
    dl_doc = backend.convert()
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 281, in convert
    self._walk(content, doc)
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 521, in _walk
    wk3 = self._walk(node, doc)
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 517, in _walk
    blk = self._handle_block(node, doc)
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 1038, in _handle_block
    self.parse_table_data(tag, doc, docling_table, num_rows, num_cols)
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 401, in parse_table_data
    HTMLDocumentBackend.process_rich_table_cells(
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling/backend/html_backend.py", line 327, in process_rich_table_cells
    doc.delete_items(node_items=[pr_item])
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 2057, in delete_items
    self._delete_items(refs=refs)
  File "/home/docqa/.venv/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 2259, in _delete_items
    raise ValueError(
ValueError: Cannot find all provided RefItems in doc: ['#/texts/0']

Docling version

2025-10-01 14:11:57,666 - INFO - Loading plugin 'docling_defaults'
2025-10-01 14:11:57,667 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.55.0
Docling Core version: 2.48.3
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python: cpython-310 (3.10.12)
Platform: Linux-6.8.0-79-generic-x86_64-with-glibc2.35

Python version

Python 3.10.12

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions