Skip to content

feat: capture picture description API usage#3632

Open
FrigaZzz wants to merge 7 commits into
docling-project:mainfrom
FrigaZzz:feature/token-usage-picture-description
Open

feat: capture picture description API usage#3632
FrigaZzz wants to merge 7 commits into
docling-project:mainfrom
FrigaZzz:feature/token-usage-picture-description

Conversation

@FrigaZzz

Copy link
Copy Markdown

This PR introduces standardized capture and propagation of raw usage metadata, such as token counters, from OpenAI/VLM-compatible picture description backends within Docling.

Context and Motivation

References: #2271, #2402, #2403, #2445

Docling can already describe document pictures through local and remote VLM backends, but API-backed picture description calls did not expose provider usage metadata to downstream users. This made it difficult to:

  • Monitor and optimize API costs
  • Debug token-related provider behavior
  • Implement rate limiting, usage quotas, or accounting outside Docling
  • Preserve provider-specific usage payloads for later validation

Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on that validation, this PR integrates usage capture directly into Docling's image API request and picture description runtime.

The implementation intentionally preserves the raw provider payload instead of forcing it into a Docling-specific token schema, because token accounting differs across providers.

What's Changed

1. Image API request results can carry usage metadata

  • Adds ApiImageRequestResult, a small result object for image API calls.
  • Preserves the historical 3-tuple behavior of api_image_request() for existing callers.
  • Adds optional usage metadata alongside generated text, token count, and stop reason.
  • Adds usage to VlmPrediction so VLM API runtimes can propagate provider usage as well.

2. Usage extraction from OpenAI-compatible responses

  • api_image_request() now parses the raw JSON response before validating the OpenAI-compatible completion payload.
  • By default, usage is extracted from the usage response field.
  • PictureDescriptionApiOptions.usage_response_key lets users select another response key or dotted path, such as providerUsage or meta.usage.
  • The plugin-style token_extract_key alias is still supported for compatibility with existing usage-capture experiments.
  • Total token count is derived from the captured usage payload when available, with fallback to the existing OpenAI-compatible usage model.

3. Picture description metadata stores captured usage

  • PictureDescriptionBaseModel now accepts either plain strings or ApiImageRequestResult outputs from _annotate_images().
  • Existing implementations that return str continue to work unchanged.
  • When usage metadata is present, it is stored as custom metadata on the picture description field:
picture.meta.description.get_custom_part()["docling__usage"]

4. Runtime integration across API picture description paths

  • PictureDescriptionApiModel passes usage_response_key through to api_image_request().
  • API VLM inference plumbing propagates usage from API responses into VlmPrediction.
  • API-backed VLM pipeline code now reads generated text from the structured response object.

5. Documentation and examples

  • Adds a usage documentation section under picture description enrichments.
  • Adds an end-to-end example: docs/examples/picture_description_api_usage.py.
  • Adds the example to the MkDocs navigation under picture annotation examples.
  • Adds focused example tests for Azure endpoint construction and empty URL handling.

Backward Compatibility

This PR is designed to avoid behavior changes for existing users.

  • Backends that do not report usage continue to produce the same picture description output.
  • Existing _annotate_images() implementations that return str continue to work.
  • Existing callers that unpack api_image_request() as a 3-tuple continue to work because ApiImageRequestResult preserves tuple-like iteration, indexing, length, and tuple equality for the historical fields.
  • Usage metadata is only attached when the provider response includes a matching usage payload.

Breaking Changes

No intentional user-facing breaking changes.

Custom picture description subclasses may optionally return ApiImageRequestResult when they want to provide usage metadata, but returning plain strings remains supported.

Example:

from docling.datamodel.base_models import ApiImageRequestResult, VlmStopReason


def _annotate_images(self, images):
    for image in images:
        yield ApiImageRequestResult(
            text=self._describe(image),
            num_tokens=None,
            stop_reason=VlmStopReason.END_OF_SEQUENCE,
            usage={"total_tokens": 42},
        )

Limitations and Next Steps

Current limitation: usage is stored as namespaced custom metadata on PictureItem.meta.description, because the canonical description annotation data model lives in docling_core.

Potential follow-ups:

  1. Add an optional usage field to the canonical description metadata model in docling_core.
  2. Adopt the new core release in docling and migrate away from custom metadata storage if desired.
  3. Add CLI/debug utilities to print picture description usage per annotation.
  4. Add more provider-specific examples for usage keys beyond OpenAI-compatible usage.

Testing

  • Added tests for usage extraction from OpenAI-compatible responses.
  • Added tests for custom and dotted usage response keys.
  • Added tests for the token_extract_key compatibility alias.
  • Added tests for preserving historical tuple-like behavior of image API responses.
  • Added tests for storing usage metadata on picture description custom metadata.
  • Added tests for Azure endpoint construction in the new docs example.

Validated locally with:

uv run pytest tests/test_api_image_request.py tests/test_picture_description_base_model.py tests/test_picture_description_api_usage_example.py
make validate

Documentation

  • Updated docs/usage/enrichments.md with usage metadata documentation.
  • Added docs/examples/picture_description_api_usage.py as a runnable and renderable example.
  • Updated the examples index and MkDocs navigation.

Checklist

  • Runtime and types updated consistently
  • Backward-compatible behavior preserved for existing callers
  • Usage propagation covered by tests
  • Public usage documentation added
  • Example code added
  • Local validation passed

References

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @FrigaZzz, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: frigazzz <frigato.luca97@gmail.com>
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.11765% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/utils/api_image_request.py 89.65% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: frigazzz <frigato.luca97@gmail.com>
Comment thread docling/datamodel/base_models.py Outdated
Comment thread tests/test_api_usage_propagation.py
Comment thread tests/test_picture_description_api_usage_example.py Outdated
Signed-off-by: frigazzz <frigato.luca97@gmail.com>
dolfim-ibm
dolfim-ibm previously approved these changes Jun 17, 2026

@dolfim-ibm dolfim-ibm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm

@FrigaZzz

Copy link
Copy Markdown
Author

thanks, lgtm

There was a test conflict, so I went ahead and resolved it.

@FrigaZzz FrigaZzz requested a review from dolfim-ibm June 19, 2026 09:33
cau-git
cau-git previously approved these changes Jun 19, 2026
FrigaZzz added 2 commits June 20, 2026 08:15
`picture_description_api_usage.py` required a positional PDF arg and a
reachable VLM endpoint, so the light examples job exited with code 2
and failed CI whenever it was selected.

- Add `picture_description_api_usage` to EXAMPLES_UNSUPPORTED_IN_CI in
  `.github/workflows/checks.yml`, matching the convention used for the
  other API-only examples (`pictures_description_api`,
  `vlm_pipeline_api_model`, ...).
- Make the example safe to run standalone: the `pdf` arg is now
  optional and defaults to `tests/data/pdf/2206.01062.pdf`, and
  `main()` exits 0 with a warning when neither
  `PICTURE_DESCRIPTION_API_URL` nor `AZURE_API_BASE` is set.

Signed-off-by: frigazzz <frigato.luca97@gmail.com>
@FrigaZzz

Copy link
Copy Markdown
Author

Hi! Had to add a small CI fix: run-examples-light was failing on the new picture_description_api_usage.py example because it needs a PDF arg and an API endpoint, so it exited with code 2 in CI.
I excluded it via EXAMPLES_UNSUPPORTED_IN_CI in checks.yml @dolfim-ibm @cau-git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants