Skip to content

Conversation

Leg0shii
Copy link

@Leg0shii Leg0shii commented Jul 29, 2025

Description

This PR enhances the DeclarativeDocumentBackend to support configurable backend options and significantly improves HTML image handling capabilities in Docling.

Key Changes:

  1. Made DeclarativeDocumentBackend generic and configurable:

    • Added generic type parameter TBackendOptions to support backend-specific options
    • Integrated backend options into FormatOption with automatic defaults
  2. Introduced configurable image handling for HTML:

    • ImageOptions.NONE: Images as placeholders only
    • ImageOptions.REFERENCED: Images with URI references
    • ImageOptions.EMBEDDED: Images embedded as base64 data
  3. Enhanced image source support:

    • HTTP/HTTPS URLs, data URLs, local files, protocol-relative URLs (//example.com)
    • Relative path resolution for both web and local contexts
  4. Improved test infrastructure:

    • Tests now validate all three image handling modes
    • Reference data organized by image option type
    • Maintains portable relative paths in test outputs

Usage Example:

from docling.datamodel.base_models import InputFormat
from docling.backend.html_backend import HTMLBackendOptions, ImageOptions
from docling.document_converter import DocumentConverter, HTMLFormatOption

# Configure HTML backend with embedded images
converter = DocumentConverter(
    format_options={
        InputFormat.HTML: HTMLFormatOption(
            backend_options=HTMLBackendOptions(
                image_options=ImageOptions.EMBEDDED
            )
        )
    }
)

Breaking Changes:

None - backward compatible with optional backend options.

Remaining Issues

  • resolve_source_to_stream returns only the end portion of the URL (e.g., "about") from full URLs like https://www.website.com/section/about. This prevents the HTML backend from properly resolving relative image paths since the full base URL is needed for correct image downloading.
  • SVG images are not supported - PIL/Pillow cannot open SVG files -> skipping them
  • The error message for failed image embedding is incorrect for HTML documents, showing: <!-- 🖼️❌ Image not available. Please use PdfPipelineOptions(generate_picture_images=True) --> (This might occur for relative path images, svg images or other reasons when a image cant be opened)
  • Some images cant be loaded from wikipedia due to: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy

I believe that these issues fall outside of the scope of this PR and should be handled in a future PRs.

Issue resolved by this Pull Request:
Resolves #1963

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link
Contributor

github-actions bot commented Jul 29, 2025

DCO Check Passed

Thanks @Leg0shii, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Jul 29, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@Leg0shii
Copy link
Author

Is documentation and examples nessecary? If so, where would I add them?

@ceberam ceberam requested review from ceberam and dolfim-ibm July 29, 2025 21:54
@ceberam ceberam self-assigned this Jul 29, 2025
@ceberam ceberam added enhancement New feature or request html issue related to html backend labels Jul 29, 2025
@ceberam
Copy link
Contributor

ceberam commented Jul 29, 2025

Hello @Leg0shii
Thanks again for your willingness to improve Docling, we really appreciate your effort.
This PR is connected to the feature of image handling in HTML but the design of this feature expands beyond this backend parser and therefore we need to check the implications of all the suggested changes very carefully.
We have some feedback that should be addressed.

In terms of design:

  • By default, Docling should never access external sites without explicit instructions from the user for several reasons, including security and user privacy. The same applies to local resources. Otherwise, converting HTML pages with malicious relative paths could expose sensitive data through the parsing and the export functions.
  • We suggest introducing 2 new flag variables in the AbstractDocumentBackend class:
    • enable_remote_fetch, to allow fetching remote images (or other resources), and
    • enable_local_fetch, to allow fetching local images (or other resources).
  • These flag variables should default to False and they would be used by the backend implementations. The idea is similar to the option enable_remote_services to explicitly opt-in in communicating with external services. Check Using remote services in the documentation for more details.
  • The HTMLDocumentBackend should never pull images by default (remotely or locally). The backend should check if the options enable_remote_fetch and enable_local_fetch have been set to True to enable that functionality. If the backend attempts to fetch images without an explicit option, a OperationNotAllowed exception should be thrown.
  • Turning DeclarativeDocumentBackend into a generic model is a good idea for handling backend options. For the type variable, we prefer the naming convention like BackendOptionsT instead of TBackendOptions. These options should be optional in the backend constructors.
  • The class BackendOptions should have a string field (e.g., kind) to distinguish the subclasses such as HTMLBackendOptions. We foresee the use of unions in type annotations and therefore having discriminated unions with str discriminators will be more efficient. You can find the same approach in BaseVlmOptions or BaseAsrOptions.
  • The class HTMLBackendOptions should have the field image_fetch (instead of image_options) of type boolean. If False (default), the backend will not access remote or local resources to fetch images. If True, the backend will try to fetch those resources and embed them in DoclingDocument. The first case corresponds to ImageOptions.NONE and the second to ImageOptions.EMBEDDED in your suggested implementation. We therefore drop the 3rd scenario (ImageOptions.REFERENCED), since we believe that DoclingDocument should be self-contained and keeping just image references should be the task of the serializers.

Other technical aspects:

  • Check your development environment since many modules (including some that are not related to this PR) show as full diff changes on git (e.g., test_backend_jats.py ). We cannot provide a proper PR review in this situation.
  • Ensure backwards compatibility (the new backend options should be optional). Therefore, all the test modules of the declarative backends (except test_backend_html.py) should not be modified.
  • Even though it is not enforced by the pre-commit hooks, please try to add docstrings on the new classes and functions, with the google docstring convention.
  • In particular, provide some documentation on the HTMLBackendOptions fields through pydantic's Field function and its description argument.
  • Avoid remote calls in regression tests, since we do not want to put extra burden to our CI/CD pipelines. Consider using unittest.mock for the embedded image option of the HTMLDocumentBackend.
  • Rebase the PR on the main branch and resolve the conflicts, since we merged some commits today.
  • You may want to add @vaaale as co-author in some commits, since the HTML image handling is based on their initial implementation.

Further improvements, out of the scope of this task, and besides those that you already listed:

  • Enable backend options in the CLI
  • Backend options should be allowed to be extended with HTTP request headers (like User-Agent) to comply with remote service policies (e.g., User-Agent policy to avoid the 403 error messages that you pointed out).

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, see the comment above

@ceberam
Copy link
Contributor

ceberam commented Jul 29, 2025

Is documentation and examples nessecary? If so, where would I add them?

Besides docstrings, no further documentation is needed in this PR.

@Leg0shii
Copy link
Author

Leg0shii commented Aug 3, 2025

Hello, thank you for the feedback, I really appreciate it!
I already took a look at the suggested changes, but sadly I got sick and cannot implement them at the moment.
Maybe someone else can take over from here?

@ceberam
Copy link
Contributor

ceberam commented Aug 18, 2025

Hello, thank you for the feedback, I really appreciate it! I already took a look at the suggested changes, but sadly I got sick and cannot implement them at the moment. Maybe someone else can take over from here?

@Leg0shii I will take it from there this week. Wish you a good recovery!
If you give me permissions to push to your fork, that would be helpful.

@Leg0shii
Copy link
Author

Im sorry for the late reply. I have invited you @ceberam

@irajank
Copy link

irajank commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm
Please <3

@ceberam
Copy link
Contributor

ceberam commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm Please <3

@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week.

@irajank
Copy link

irajank commented Sep 23, 2025

Waiting for this PR to be Merged @ceberam @dolfim-ibm Please <3

@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week.

Hey thanks for speedy response. Looking forward for ASAP merge.
Again Thanks.

@punit1108
Copy link

@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged?

@ceberam
Copy link
Contributor

ceberam commented Sep 29, 2025

@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged?

@punit1108 We expect to have it merged by the end of today

Leg0shii and others added 6 commits October 10, 2025 17:36
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…ove HTML image handling

Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…are set correctly

Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…le paths

Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from c9d3efc to 25f8bc3 Compare October 10, 2025 15:43
@ceberam ceberam marked this pull request as draft October 10, 2025 15:56
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Leg0shii <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
@ceberam ceberam force-pushed the allow-parameters-in-declarative-document-backend branch from 25f8bc3 to 2826b8c Compare October 10, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request html issue related to html backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow parameters in HTML backend or any DeclarativeDocumentBackend implementation

4 participants