-
Notifications
You must be signed in to change notification settings - Fork 2.9k
feat(models): add API usage to picture descriptions; unify response type #2445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(models): add API usage to picture descriptions; unify response type #2445
Conversation
- Introduce ApiImageResponse and OpenAiResponseUsage to carry usage metadata from image API calls - Add DescriptionAnnotationWithUsage to store usage alongside description text - Change _annotate_images to return Iterable[ApiImageResponse]; update API and VLM models to comply - Fix ApiVlmModel to decode responses using response.text instead of the raw response object Why: enables tracking/reporting of OpenAI/VLM token usage in picture description annotations. BREAKING CHANGE: subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse Signed-off-by: FrigaZzz <[email protected]>
…description-model-2271
✅ DCO Check Passed Thanks @FrigaZzz, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Hi! The main issue I see with moving the The export logic (HTML and Markdown serializers) relies on type checking to verify that annotations derive from the These checks are implemented in several places:
They all check: I created It's actually not a bad idea to keep the fundamental The trade-off: This does introduce some ambiguity for end users. When developing plugin extensions or working with The cleanest approach would still be to add the
|
All correct. I initially missed that the usage is going in the docling-core I would anyway like to emphasize a bit more the fact that class DescriptionAnnotation(BaseAnnotation):
"""DescriptionAnnotation."""
kind: Literal["description"] = "description"
text: str
provenance: str
inference_details: dict[str, Any] = {} # add usage here, or we could make it a BaseModel with usage and potentially other details |
This PR introduces standardized capture and propagation of usage metadata (token counters, etc.) from OpenAI/VLM-compatible picture description backends within docling.
Context and motivation
References: #2271, #2402, #2403
Currently, docling has no built-in way to track resource consumption (tokens, API calls) when using picture description models. This makes it difficult for users to:
Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on positive feedback, this PR integrates usage tracking directly into docling's model runtime.
Long-term plan: Move the usage field into
docling_core
so it becomes part of the canonical annotation data model. This PR focuses on the runtime wiring in docling to keep changes reviewable and unblock immediate usage capture needs.What's changed
1. New response types for image API calls
ApiImageResponse
: Carries both generated text and optional usage metadata from image APIsOpenAiResponseUsage
: Represents token usage (input_tokens
,output_tokens
,total_tokens
, etc.) from OpenAI-compatible backends2. Usage metadata storage
DescriptionAnnotationWithUsage
: Temporary wrapper enabling the runtime to attach usage metadata to eachDescription
annotation produced by picture description models3. Runtime integration
PictureDescriptionBaseModel._annotate_images()
now returnsIterable[ApiImageResponse]
(previously plain text strings)ApiVlmModel
updated to decode usingresponse.text
instead of raw response objects4. Backward compatibility
usage
field remainsNone
and pipeline output is unchangedBreaking changes
PictureDescriptionBaseModel
must update_annotate_images()
to returnApiImageResponse
instead ofstr
.Migration example:
Documentation
docling_core
integration to document the usage field on description annotationsLimitations and next steps
Current limitation: The canonical
Description
annotation type lives indocling_core
. In this PR, usage is temporarily attached viaDescriptionAnnotationWithUsage
in docling to validate the wiring.Proposed follow-up PRs:
docling_core
: Add optionalusage
field to the canonicalDescriptionAnnotation
docling
: Adopt the new core release, remove temporaryDescriptionAnnotationWithUsage
, and complete end-to-end integrationTesting
Checklist
docling_core
integration)docling_core
integration)References