feat(service): retrieve convert()/convert_all() results via presigned artifacts#3578
Merged
Merged
Conversation
… convert()/convert_all() convert() and convert_all() previously hard-coded InBodyTarget, which returns the full document JSON (and embedded images) inline on every result fetch — prohibitively expensive in transport for hosted deployments. They now mirror submit()'s auto-target behaviour: try PresignedUrlTarget first and fall back to InBodyTarget only when the server rejects presigned output for configuration/policy reasons. When the server serves presigned artifacts, the client downloads them and rebuilds a self-contained ConversionResult: - resource_bundle (REFERENCED images): download the ZIP, load the document JSON, and inline the referenced picture and page images so the result carries no on-disk dependency. - json artifact (EMBEDDED/PLACEHOLDER): load the self-contained JSON directly. Artifact downloads are hardened: a dedicated client, a streamed size cap, and an SSRF guard that validates the initial URL and every rerect hop against globally-routable addresses. allow_private_artifact_urls opts into private/loopback storage endpoints. Download or reconstruction failures degrade into a FAILURE ConversionResult so convert_all() keeps processing the remaining inputs. Also removes the experimental warning from DoclingServiceClient — the SDK is final — and replaces the InBody-target-as-JSON-flag indirection with an explicit helper. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Contributor
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
dolfim-ibm
approved these changes
Jun 11, 2026
haladamateusz
pushed a commit
to haladamateusz/docling
that referenced
this pull request
Jun 12, 2026
… artifacts (docling-project#3578) * feat(service): materialize ConversionResult from presigned targets in convert()/convert_all() convert() and convert_all() previously hard-coded InBodyTarget, which returns the full document JSON (and embedded images) inline on every result fetch — prohibitively expensive in transport for hosted deployments. They now mirror submit()'s auto-target behaviour: try PresignedUrlTarget first and fall back to InBodyTarget only when the server rejects presigned output for configuration/policy reasons. When the server serves presigned artifacts, the client downloads them and rebuilds a self-contained ConversionResult: - resource_bundle (REFERENCED images): download the ZIP, load the document JSON, and inline the referenced picture and page images so the result carries no on-disk dependency. - json artifact (EMBEDDED/PLACEHOLDER): load the self-contained JSON directly. Artifact downloads are hardened: a dedicated client, a streamed size cap, and an SSRF guard that validates the initial URL and every rerect hop against globally-routable addresses. allow_private_artifact_urls opts into private/loopback storage endpoints. Download or reconstruction failures degrade into a FAILURE ConversionResult so convert_all() keeps processing the remaining inputs. Also removes the experimental warning from DoclingServiceClient — the SDK is final — and replaces the InBody-target-as-JSON-flag indirection with an explicit helper. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for correctness and cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Drop unnecessary test units Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DoclingServiceClient.convert()andconvert_all()now retrieve results over presigned artifact storage instead of always inlining them in the HTTP response. Inlining (InBodyTarget) returns the full document JSON and embedded images on every result fetch, which is too expensive in transport for hosted services. These high-level methods now use the same auto-target flow assubmit(): PresignedUrlTarget first, falling back to InBodyTarget only when the server rejects presigned output for configuration/policy reasons. The presigned path is fully materialized back into aConversionResult, so callers see no API change.What changed
ConversionResult. For aresource_bundle(REFERENCED image mode) the ZIP is downloaded and extracted, the document JSON is loaded, and referenced picture/page images are inlined into the document so the result has no on-disk dependency. For ajsonartifact (EMBEDDED/PLACEHOLDER) the self-contained JSON is loaded directly.X-Api-Keyis never sent to external storage), enforce a streamed size cap, and follow redirects manually so every hop is re-validated.allow_private_artifact_urls=Trueopts into private/loopback storage endpoints (e.g. on-prem or local MinIO).FAILUREConversionResultwith an explanatory error item;convert_all()continues with the remaining inputs andconvert()surfaces it viaraises_on_error.DoclingServiceClientand the matching notices in the SDK examples._with_json_output_formathelper used by every path that reconstructs a document.Public API
DoclingServiceClient(..., allow_private_artifact_urls=False, artifact_download_timeout=60.0, max_artifact_download_bytes=512 MiB)— new constructor options.ArtifactDownloadError— new exception type (exported), used internally for graceful degradation.submit()/submit_batch()/submit_and_retrieve_each()are unchanged and still return raw presigned responses.Behaviour change
Against a server with artifact storage configured,
convert()/convert_all()now route through presigned download instead of inline transport. Servers without artifact storage transparently fall back to InBodyTarget, so existing behaviour is preserved there.