Skip to content

Supporting re-prompting VLM for picture description if the description is bad #2412

@jamesbraza

Description

@jamesbraza

Requested feature

Piloting do_picture_description with docling==2.55.1's default picture description model HuggingFaceTB/SmolVLM-256M-Instruct, I am seeing almost all descriptions are not useful. Here's a bulleted list of descriptions I am seeing when processing a big PDF:

  • In
  • This image is a diagram showing different types of sensors and their response functions. The diagram is labeled with the names of the different types of sensors and their response functions. The diagram is divided into two main sections: the left section shows the response function of the sensor, while the right section shows the response function of the sensor. The response function of the sensor is represented by a line graph, and the response function of the sensor is represented by a curve.
  • This
  • In
  • This
  • In
  • This
  • This
  • This
  • In
  • The
  • This
  • The x-axis label is "Time (s)."
  • The y-axis label is "Membrane
  • This
  • In

We can see the majority of them are one-word. Ignoring HuggingFaceTB/SmolVLM-256M-Instruct's poor performance, it exposes that Docling could benefit from a feature where:

  1. If some predicate determines the response is insufficient

    • For example, checking common failure modes: lambda x: x.strip().lower() in {"in", "this", "the"}
  2. The picture description model can be re-prompted with failure details

    Describe this image in a few sentences.
    
    Your previous description of "This" is insufficient.
    

Alternatives

Expanding the PictureDescriptionVlmOptions.prompt:

  • Include examples of desirable descriptions
  • Mention a minimum word length

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions