-
Notifications
You must be signed in to change notification settings - Fork 2.9k
feat(backend): add generic options support and HTML image handling modes #2011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(backend): add generic options support and HTML image handling modes #2011
Conversation
✅ DCO Check Passed Thanks @Leg0shii, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Is documentation and examples nessecary? If so, where would I add them? |
Hello @Leg0shii In terms of design:
Other technical aspects:
Further improvements, out of the scope of this task, and besides those that you already listed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see the comment above
Besides docstrings, no further documentation is needed in this PR. |
Hello, thank you for the feedback, I really appreciate it! |
@Leg0shii I will take it from there this week. Wish you a good recovery! |
Im sorry for the late reply. I have invited you @ceberam |
Waiting for this PR to be Merged @ceberam @dolfim-ibm |
@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week. |
Hey thanks for speedy response. Looking forward for ASAP merge. |
@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged? |
@punit1108 We expect to have it merged by the end of today |
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…ove HTML image handling Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…are set correctly Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
…le paths Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
c9d3efc
to
25f8bc3
Compare
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Leg0shii <[email protected]> Signed-off-by: Cesar Berrospi Ramis <[email protected]>
25f8bc3
to
2826b8c
Compare
Description
This PR enhances the DeclarativeDocumentBackend to support configurable backend options and significantly improves HTML image handling capabilities in Docling.
Key Changes:
Made DeclarativeDocumentBackend generic and configurable:
TBackendOptions
to support backend-specific optionsFormatOption
with automatic defaultsIntroduced configurable image handling for HTML:
Enhanced image source support:
//example.com
)Improved test infrastructure:
Usage Example:
Breaking Changes:
None - backward compatible with optional backend options.
Remaining Issues
resolve_source_to_stream
returns only the end portion of the URL (e.g., "about") from full URLs like https://www.website.com/section/about. This prevents the HTML backend from properly resolving relative image paths since the full base URL is needed for correct image downloading.<!-- 🖼️❌ Image not available. Please use
PdfPipelineOptions(generate_picture_images=True)-->
(This might occur for relative path images, svg images or other reasons when a image cant be opened)403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy
I believe that these issues fall outside of the scope of this PR and should be handled in a future PRs.
Issue resolved by this Pull Request:
Resolves #1963
Checklist: