Skip to content

feat(service): retrieve convert()/convert_all() results via presigned artifacts#3578

Merged
dolfim-ibm merged 3 commits into
mainfrom
cau/service-client-populate-response-from-url
Jun 11, 2026
Merged

feat(service): retrieve convert()/convert_all() results via presigned artifacts#3578
dolfim-ibm merged 3 commits into
mainfrom
cau/service-client-populate-response-from-url

Conversation

@cau-git

@cau-git cau-git commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

DoclingServiceClient.convert() and convert_all() now retrieve results over presigned artifact storage instead of always inlining them in the HTTP response. Inlining (InBodyTarget) returns the full document JSON and embedded images on every result fetch, which is too expensive in transport for hosted services. These high-level methods now use the same auto-target flow as submit(): PresignedUrlTarget first, falling back to InBodyTarget only when the server rejects presigned output for configuration/policy reasons. The presigned path is fully materialized back into a ConversionResult, so callers see no API change.

What changed

  • Presigned materialization. When the server returns presigned artifacts, the client downloads them and rebuilds a self-contained ConversionResult. For a resource_bundle (REFERENCED image mode) the ZIP is downloaded and extracted, the document JSON is loaded, and referenced picture/page images are inlined into the document so the result has no on-disk dependency. For a json artifact (EMBEDDED/PLACEHOLDER) the self-contained JSON is loaded directly.
  • Hardened downloads. Artifact fetches use a dedicated HTTP client (the service X-Api-Key is never sent to external storage), enforce a streamed size cap, and follow redirects manually so every hop is re-validated.
  • SSRF guard. Artifact URLs (and each redirect target) must resolve to a globally-routable address; private, loopback, link-local, reserved, multicast and unspecified addresses are rejected. allow_private_artifact_urls=True opts into private/loopback storage endpoints (e.g. on-prem or local MinIO).
  • Graceful failure. A failed download or reconstruction yields a FAILURE ConversionResult with an explanatory error item; convert_all() continues with the remaining inputs and convert() surfaces it via raises_on_error.
  • Final API. Removed the experimental warning emitted by DoclingServiceClient and the matching notices in the SDK examples.
  • Cleanup. Replaced the implicit "pass InBodyTarget to force JSON" pattern with an explicit _with_json_output_format helper used by every path that reconstructs a document.

Public API

  • DoclingServiceClient(..., allow_private_artifact_urls=False, artifact_download_timeout=60.0, max_artifact_download_bytes=512 MiB) — new constructor options.
  • ArtifactDownloadError — new exception type (exported), used internally for graceful degradation.
  • submit() / submit_batch() / submit_and_retrieve_each() are unchanged and still return raw presigned responses.

Behaviour change

Against a server with artifact storage configured, convert()/convert_all() now route through presigned download instead of inline transport. Servers without artifact storage transparently fall back to InBodyTarget, so existing behaviour is preserved there.

… convert()/convert_all()

convert() and convert_all() previously hard-coded InBodyTarget, which returns
the full document JSON (and embedded images) inline on every result fetch —
prohibitively expensive in transport for hosted deployments.

They now mirror submit()'s auto-target behaviour: try PresignedUrlTarget first
and fall back to InBodyTarget only when the server rejects presigned output for
configuration/policy reasons. When the server serves presigned artifacts, the
client downloads them and rebuilds a self-contained ConversionResult:

- resource_bundle (REFERENCED images): download the ZIP, load the document JSON,
  and inline the referenced picture and page images so the result carries no
  on-disk dependency.
- json artifact (EMBEDDED/PLACEHOLDER): load the self-contained JSON directly.

Artifact downloads are hardened: a dedicated client, a streamed size cap, and an SSRF guard that validates the
initial URL and every rerect hop against globally-routable addresses.
allow_private_artifact_urls opts into private/loopback storage endpoints.
Download or reconstruction failures degrade into a FAILURE ConversionResult so
convert_all() keeps processing the remaining inputs.

Also removes the experimental warning from DoclingServiceClient — the SDK is
final — and replaces the InBody-target-as-JSON-flag indirection with an explicit
helper.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git requested a review from dolfim-ibm June 10, 2026 11:44
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dolfim-ibm
dolfim-ibm previously approved these changes Jun 10, 2026

@dolfim-ibm dolfim-ibm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.62745% with 64 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/service_client/client.py 68.47% 64 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@dolfim-ibm dolfim-ibm merged commit 521e86b into main Jun 11, 2026
26 checks passed
@dolfim-ibm dolfim-ibm deleted the cau/service-client-populate-response-from-url branch June 11, 2026 12:38
haladamateusz pushed a commit to haladamateusz/docling that referenced this pull request Jun 12, 2026
… artifacts (docling-project#3578)

* feat(service): materialize ConversionResult from presigned targets in convert()/convert_all()

convert() and convert_all() previously hard-coded InBodyTarget, which returns
the full document JSON (and embedded images) inline on every result fetch —
prohibitively expensive in transport for hosted deployments.

They now mirror submit()'s auto-target behaviour: try PresignedUrlTarget first
and fall back to InBodyTarget only when the server rejects presigned output for
configuration/policy reasons. When the server serves presigned artifacts, the
client downloads them and rebuilds a self-contained ConversionResult:

- resource_bundle (REFERENCED images): download the ZIP, load the document JSON,
  and inline the referenced picture and page images so the result carries no
  on-disk dependency.
- json artifact (EMBEDDED/PLACEHOLDER): load the self-contained JSON directly.

Artifact downloads are hardened: a dedicated client, a streamed size cap, and an SSRF guard that validates the
initial URL and every rerect hop against globally-routable addresses.
allow_private_artifact_urls opts into private/loopback storage endpoints.
Download or reconstruction failures degrade into a FAILURE ConversionResult so
convert_all() keeps processing the remaining inputs.

Also removes the experimental warning from DoclingServiceClient — the SDK is
final — and replaces the InBody-target-as-JSON-flag indirection with an explicit
helper.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for correctness and cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Drop unnecessary test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants