Skip to content

Conversation

FrigaZzz
Copy link

This PR introduces standardized capture and propagation of usage metadata (token counters, etc.) from OpenAI/VLM-compatible picture description backends within docling.

Context and motivation

References: #2271, #2402, #2403

Currently, docling has no built-in way to track resource consumption (tokens, API calls) when using picture description models. This makes it difficult for users to:

  • Monitor and optimize API costs
  • Debug performance issues
  • Implement rate limiting or usage quotas

Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on positive feedback, this PR integrates usage tracking directly into docling's model runtime.

Long-term plan: Move the usage field into docling_core so it becomes part of the canonical annotation data model. This PR focuses on the runtime wiring in docling to keep changes reviewable and unblock immediate usage capture needs.

What's changed

1. New response types for image API calls

  • ApiImageResponse: Carries both generated text and optional usage metadata from image APIs
  • OpenAiResponseUsage: Represents token usage (input_tokens, output_tokens, total_tokens, etc.) from OpenAI-compatible backends

2. Usage metadata storage

  • DescriptionAnnotationWithUsage: Temporary wrapper enabling the runtime to attach usage metadata to each Description annotation produced by picture description models

3. Runtime integration

  • PictureDescriptionBaseModel._annotate_images() now returns Iterable[ApiImageResponse] (previously plain text strings)
  • API-backed and VLM-backed picture description models updated to use the new response type and propagate usage
  • ApiVlmModel updated to decode using response.text instead of raw response objects

4. Backward compatibility

  • No behavior change if a backend doesn't report usage: the usage field remains None and pipeline output is unchanged
  • Existing pipelines continue to work without modification

Breaking changes

⚠️ Subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse instead of str.

Migration example:

# Before
def _annotate_images(self, images: List[Image]) -> Iterable[str]:
    return [self._describe(img) for img in images]

# After
def _annotate_images(self, images: List[Image]) -> Iterable[ApiImageResponse]:
    return [
        ApiImageResponse(text=self._describe(img), usage=self._get_usage())
        for img in images
    ]

Documentation

Limitations and next steps

Current limitation: The canonical Description annotation type lives in docling_core. In this PR, usage is temporarily attached via DescriptionAnnotationWithUsage in docling to validate the wiring.

Proposed follow-up PRs:

  1. docling_core: Add optional usage field to the canonical DescriptionAnnotation
  2. docling: Adopt the new core release, remove temporary DescriptionAnnotationWithUsage, and complete end-to-end integration
  3. Documentation: Add "Usage telemetry" section for picture descriptions; optionally add CLI/debug utilities to print usage per annotation

Testing

Checklist

  • Commit messages follow conventional commits
  • Runtime and types updated consistently
  • Breaking changes documented with migration guide
  • Public documentation (pending docling_core integration)
  • Example code updates (pending docling_core integration)

References

FrigaZzz and others added 2 commits October 11, 2025 13:38
- Introduce ApiImageResponse and OpenAiResponseUsage to carry usage metadata from image API calls
- Add DescriptionAnnotationWithUsage to store usage alongside description text
- Change _annotate_images to return Iterable[ApiImageResponse]; update API and VLM models to comply
- Fix ApiVlmModel to decode responses using response.text instead of the raw response object

Why: enables tracking/reporting of OpenAI/VLM token usage in picture description annotations.

BREAKING CHANGE: subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse
Signed-off-by: FrigaZzz <[email protected]>
Copy link
Contributor

DCO Check Passed

Thanks @FrigaZzz, all your commits are properly signed off. 🎉

Copy link

dosubot bot commented Oct 11, 2025

Related Documentation

Checked 2 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link

mergify bot commented Oct 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@FrigaZzz FrigaZzz changed the title feat(models): add API usage to picture descriptions; unify response type; fix VLM decoding feat(models): add API usage to picture descriptions; unify response type Oct 11, 2025
@dolfim-ibm
Copy link
Contributor

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/utils/api_image_request.py 28.57% 15 Missing ⚠️
docling/models/picture_description_vlm_model.py 44.44% 5 Missing ⚠️
docling/models/api_vlm_model.py 0.00% 1 Missing ⚠️
docling/models/picture_description_base_model.py 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@FrigaZzz
Copy link
Author

FrigaZzz commented Oct 13, 2025

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

Hi!

The main issue I see with moving the DescriptionAnnotation class from docling-core to the docling package (and adding the usage metadata there) is that it would break the serialization logic in docling-core, causing problems with the export functionality.

The export logic (HTML and Markdown serializers) relies on type checking to verify that annotations derive from the DescriptionAnnotation base type. If we move this class to the docling package, the isinstance() checks in docling-core would fail unless we'd create an unwanted dependency where docling-core would need to depend on docling (reversing the intended dependency direction).

These checks are implemented in several places:

They all check: isinstance(annotation, DescriptionAnnotation)

I created DescriptionAnnotationWithUsage as an extension of the base DescriptionAnnotation (which remains in docling-core). Through inheritance, all the internal instance checks in docling-core continue to work seamlessly without any modifications.

It's actually not a bad idea to keep the fundamental DescriptionAnnotation type as part of the CORE package, while having a more feature-rich, extended version directly in the docling package. This follows good separation of concerns, docling-core provides the basic primitives, and docling extends them with additional functionality (like usage tracking).

The trade-off: This does introduce some ambiguity for end users. When developing plugin extensions or working with PictureDescriptionBaseModel, they might encounter DescriptionAnnotation from docling-core or DescriptionAnnotationWithUsage from docling, which can be confusing.

The cleanest approach would still be to add the usage field directly to DescriptionAnnotation in docling-core. This would:

  • Avoid having docling-core depend on docling (maintaining proper dependency direction)
  • Eliminate ambiguity by having a single, authoritative annotation type
  • Provide a cleaner API for all users
  • But it requires coordinated releases of both packages (docling-core first, then docling), which is why I opted for the current approach to unblock usage tracking functionality asap

@dolfim-ibm
Copy link
Contributor

All correct. I initially missed that the usage is going in the docling-core DescriptionAnnotation class. Given that, I still think it is worth to simplify and just have one model runtime which is potentially aware of the token usage count.

I would anyway like to emphasize a bit more the fact that usage is some model metadata. What about storing it as

class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str

    inference_details: dict[str, Any] = {}  # add usage here, or we could make it a BaseModel with usage and potentially other details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants