-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Requested feature
Piloting do_picture_description
with docling==2.55.1
's default picture description model HuggingFaceTB/SmolVLM-256M-Instruct, I am seeing almost all descriptions are not useful. Here's a bulleted list of descriptions I am seeing when processing a big PDF:
- In
- This image is a diagram showing different types of sensors and their response functions. The diagram is labeled with the names of the different types of sensors and their response functions. The diagram is divided into two main sections: the left section shows the response function of the sensor, while the right section shows the response function of the sensor. The response function of the sensor is represented by a line graph, and the response function of the sensor is represented by a curve.
- This
- In
- This
- In
- This
- This
- This
- In
- The
- This
- The x-axis label is "Time (s)."
- The y-axis label is "Membrane
- This
- In
We can see the majority of them are one-word. Ignoring HuggingFaceTB/SmolVLM-256M-Instruct
's poor performance, it exposes that Docling could benefit from a feature where:
-
If some predicate determines the response is insufficient
- For example, checking common failure modes:
lambda x: x.strip().lower() in {"in", "this", "the"}
- For example, checking common failure modes:
-
The picture description model can be re-prompted with failure details
Describe this image in a few sentences. Your previous description of "This" is insufficient.
Alternatives
Expanding the PictureDescriptionVlmOptions.prompt
:
- Include examples of desirable descriptions
- Mention a minimum word length
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request