feat: capture picture description API usage by FrigaZzz · Pull Request #3632 · docling-project/docling

FrigaZzz · 2026-06-17T01:04:58Z

This PR introduces standardized capture and propagation of raw usage metadata, such as token counters, from OpenAI/VLM-compatible picture description backends within Docling.

Context and Motivation

References: #2271, #2402, #2403, #2445

Docling can already describe document pictures through local and remote VLM backends, but API-backed picture description calls did not expose provider usage metadata to downstream users. This made it difficult to:

Monitor and optimize API costs
Debug token-related provider behavior
Implement rate limiting, usage quotas, or accounting outside Docling
Preserve provider-specific usage payloads for later validation

Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on that validation, this PR integrates usage capture directly into Docling's image API request and picture description runtime.

The implementation intentionally preserves the raw provider payload instead of forcing it into a Docling-specific token schema, because token accounting differs across providers.

What's Changed

1. Image API request results can carry usage metadata

Adds ApiImageRequestResult, a small result object for image API calls.
Preserves the historical 3-tuple behavior of api_image_request() for existing callers.
Adds optional usage metadata alongside generated text, token count, and stop reason.
Adds usage to VlmPrediction so VLM API runtimes can propagate provider usage as well.

2. Usage extraction from OpenAI-compatible responses

api_image_request() now parses the raw JSON response before validating the OpenAI-compatible completion payload.
By default, usage is extracted from the usage response field.
PictureDescriptionApiOptions.usage_response_key lets users select another response key or dotted path, such as providerUsage or meta.usage.
The plugin-style token_extract_key alias is still supported for compatibility with existing usage-capture experiments.
Total token count is derived from the captured usage payload when available, with fallback to the existing OpenAI-compatible usage model.

3. Picture description metadata stores captured usage

PictureDescriptionBaseModel now accepts either plain strings or ApiImageRequestResult outputs from _annotate_images().
Existing implementations that return str continue to work unchanged.
When usage metadata is present, it is stored as custom metadata on the picture description field:

picture.meta.description.get_custom_part()["docling__usage"]

4. Runtime integration across API picture description paths

PictureDescriptionApiModel passes usage_response_key through to api_image_request().
API VLM inference plumbing propagates usage from API responses into VlmPrediction.
API-backed VLM pipeline code now reads generated text from the structured response object.

5. Documentation and examples

Adds a usage documentation section under picture description enrichments.
Adds an end-to-end example: docs/examples/picture_description_api_usage.py.
Adds the example to the MkDocs navigation under picture annotation examples.
Adds focused example tests for Azure endpoint construction and empty URL handling.

Backward Compatibility

This PR is designed to avoid behavior changes for existing users.

Backends that do not report usage continue to produce the same picture description output.
Existing _annotate_images() implementations that return str continue to work.
Existing callers that unpack api_image_request() as a 3-tuple continue to work because ApiImageRequestResult preserves tuple-like iteration, indexing, length, and tuple equality for the historical fields.
Usage metadata is only attached when the provider response includes a matching usage payload.

Breaking Changes

No intentional user-facing breaking changes.

Custom picture description subclasses may optionally return ApiImageRequestResult when they want to provide usage metadata, but returning plain strings remains supported.

Example:

from docling.datamodel.base_models import ApiImageRequestResult, VlmStopReason


def _annotate_images(self, images):
    for image in images:
        yield ApiImageRequestResult(
            text=self._describe(image),
            num_tokens=None,
            stop_reason=VlmStopReason.END_OF_SEQUENCE,
            usage={"total_tokens": 42},
        )

Limitations and Next Steps

Current limitation: usage is stored as namespaced custom metadata on PictureItem.meta.description, because the canonical description annotation data model lives in docling_core.

Potential follow-ups:

Add an optional usage field to the canonical description metadata model in docling_core.
Adopt the new core release in docling and migrate away from custom metadata storage if desired.
Add CLI/debug utilities to print picture description usage per annotation.
Add more provider-specific examples for usage keys beyond OpenAI-compatible usage.

Testing

Added tests for usage extraction from OpenAI-compatible responses.
Added tests for custom and dotted usage response keys.
Added tests for the token_extract_key compatibility alias.
Added tests for preserving historical tuple-like behavior of image API responses.
Added tests for storing usage metadata on picture description custom metadata.
Added tests for Azure endpoint construction in the new docs example.

Validated locally with:

uv run pytest tests/test_api_image_request.py tests/test_picture_description_base_model.py tests/test_picture_description_api_usage_example.py
make validate

Documentation

Updated docs/usage/enrichments.md with usage metadata documentation.
Added docs/examples/picture_description_api_usage.py as a runnable and renderable example.
Updated the examples index and MkDocs navigation.

Checklist

Runtime and types updated consistently
Backward-compatible behavior preserved for existing callers
Usage propagation covered by tests
Public usage documentation added
Example code added
Local validation passed

References

Get the total tokens used in the image description #2271 - Early validation example
Add documentation and examples for custom plugin development #2402 - Feature discussion and motivation
docs(custom-plugins): retrieve TOKEN USAGE for Image Processing with Custom PictureDescriptionApiModel #2403 - Plugin proof-of-concept

github-actions · 2026-06-17T01:05:08Z

✅ DCO Check Passed

Thanks @FrigaZzz, all your commits are properly signed off. 🎉

mergify · 2026-06-17T01:05:34Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: frigazzz <frigato.luca97@gmail.com>

codecov · 2026-06-17T06:36:14Z

Codecov Report

❌ Patch coverage is 94.11765% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/utils/api_image_request.py	89.65%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: frigazzz <frigato.luca97@gmail.com>

dolfim-ibm

thanks, lgtm

FrigaZzz · 2026-06-18T19:04:47Z

thanks, lgtm

There was a test conflict, so I went ahead and resolved it.

…description

`picture_description_api_usage.py` required a positional PDF arg and a reachable VLM endpoint, so the light examples job exited with code 2 and failed CI whenever it was selected. - Add `picture_description_api_usage` to EXAMPLES_UNSUPPORTED_IN_CI in `.github/workflows/checks.yml`, matching the convention used for the other API-only examples (`pictures_description_api`, `vlm_pipeline_api_model`, ...). - Make the example safe to run standalone: the `pdf` arg is now optional and defaults to `tests/data/pdf/2206.01062.pdf`, and `main()` exits 0 with a warning when neither `PICTURE_DESCRIPTION_API_URL` nor `AZURE_API_BASE` is set. Signed-off-by: frigazzz <frigato.luca97@gmail.com>

…hub.com/FrigaZzz/docling into feature/token-usage-picture-description

FrigaZzz · 2026-06-20T06:20:58Z

Hi! Had to add a small CI fix: run-examples-light was failing on the new picture_description_api_usage.py example because it needs a PDF arg and an API endpoint, so it exited with code 2 in CI.
I excluded it via EXAMPLES_UNSUPPORTED_IN_CI in checks.yml @dolfim-ibm @cau-git

feat: capture picture description API usage

afad01f

Signed-off-by: frigazzz <frigato.luca97@gmail.com>

FrigaZzz force-pushed the feature/token-usage-picture-description branch from 11f5ae4 to afad01f Compare June 17, 2026 01:09

FrigaZzz mentioned this pull request Jun 17, 2026

feat(models): add API usage to picture descriptions; unify response type #2445

Closed

5 tasks

test: cover API usage propagation

c91b94f

Signed-off-by: frigazzz <frigato.luca97@gmail.com>

dolfim-ibm reviewed Jun 17, 2026

View reviewed changes

Comment thread docling/datamodel/base_models.py Outdated

Comment thread tests/test_api_usage_propagation.py

Comment thread tests/test_picture_description_api_usage_example.py Outdated

test: simplify API usage coverage

476653a

Signed-off-by: frigazzz <frigato.luca97@gmail.com>

dolfim-ibm previously approved these changes Jun 17, 2026

View reviewed changes

Merge upstream main into feature branch

568e9b3

FrigaZzz dismissed dolfim-ibm’s stale review via 568e9b3 June 18, 2026 17:30

Merge branch 'docling-project:main' into feature/token-usage-picture-…

c23ad5a

…description

FrigaZzz requested a review from dolfim-ibm June 19, 2026 09:33

cau-git previously approved these changes Jun 19, 2026

View reviewed changes

FrigaZzz added 2 commits June 20, 2026 08:15

Merge branch 'feature/token-usage-picture-description' of https://git…

0ed38cb

…hub.com/FrigaZzz/docling into feature/token-usage-picture-description

FrigaZzz dismissed cau-git’s stale review via 0ed38cb June 20, 2026 06:16

FrigaZzz requested a review from cau-git June 20, 2026 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture picture description API usage#3632

feat: capture picture description API usage#3632
FrigaZzz wants to merge 7 commits into
docling-project:mainfrom
FrigaZzz:feature/token-usage-picture-description

FrigaZzz commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dolfim-ibm left a comment

Uh oh!

FrigaZzz commented Jun 18, 2026

Uh oh!

FrigaZzz commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FrigaZzz commented Jun 17, 2026

Context and Motivation

What's Changed

1. Image API request results can carry usage metadata

2. Usage extraction from OpenAI-compatible responses

3. Picture description metadata stores captured usage

4. Runtime integration across API picture description paths

5. Documentation and examples

Backward Compatibility

Breaking Changes

Limitations and Next Steps

Testing

Documentation

Checklist

References

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Jun 17, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

FrigaZzz commented Jun 18, 2026

Uh oh!

FrigaZzz commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading