Skip to content

Image formats not generating picture descriptions, only OCR text extraction #2446

@JViktoRArtola

Description

@JViktoRArtola

When converting images using Docling, the library does not generate picture descriptions for image formats. It only performs OCR text extraction when text is present in the image. The pictures=[] array remains empty for all images, regardless of whether they contain text or not, making it impossible to retrieve any visual content descriptions.

However, if the same image is embedded in a PDF file, Docling correctly generates picture descriptions and populates the pictures array. This inconsistency suggests that the image processing pipeline behaves differently for standalone image formats (PNG, JPG) versus images within PDF documents.

This image features a close-up of an adorable ginger tabby kitten with bright, curious blue eyes. The kitten has a soft, orange-striped coat and is gazing up with an expression of innocence and wonder. The background is softly blurred, drawing attention to the kitten’s sweet and delicate features.

Steps to reproduce

  1. Set up Docling with the following configuration:

    InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    # or InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
    
     def _build_full_pipeline_options(self) -> PdfPipelineOptions:
         """Build the full/accurate PdfPipelineOptions configuration.
    
         Includes OCR, table structure, formula/code enrichment and picture description.
         """
         pipeline_options = PdfPipelineOptions()
         pipeline_options.do_ocr = True
         pipeline_options.do_table_structure = True
         pipeline_options.do_formula_enrichment = True
         pipeline_options.do_code_enrichment = True
         pipeline_options.generate_picture_images = True
         pipeline_options.enable_remote_services = True
         pipeline_options.do_picture_description = True
         pipeline_options.picture_description_options = self._picture_description_options # OpenAI
         return pipeline_options
  2. Convert two versions of the same image:

    • Image with text (pussInBoots.png): Contains the text "am Puss in Boots"
    • Image without text (pussInBoots_no_text.png): Same image with text removed
    result = converter.convert(image_file)
    markdown_content = result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED)
  3. Inspect the resulting DoclingDocument objects:

    print(result.document.pictures)  # Returns empty list [] for both images
    print(result.document.texts)     # Returns text only for image with text
  4. Convert the same image embedded in a PDF file and observe that picture descriptions are correctly generated.

Expected behavior: The pictures array should contain picture items with descriptions of the visual content (e.g., "A cat wearing boots and a hat") for all images, regardless of whether text is present and regardless of whether the image is standalone or embedded in a PDF.

Actual behavior:

  • For standalone images with text: Only OCR text extraction occurs (texts=['am Puss in Boots']), no picture description generated (pictures=[])
  • For standalone images without text: No content extracted at all (texts=[], pictures=[])
  • For images in PDF files: Picture descriptions are correctly generated and populate the pictures array
  • The image processing pipeline appears to focus exclusively on text extraction for standalone image formats and does not generate visual content descriptions

Example output for image WITH text:

python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' texts=[TextItem(..., orig='am Puss in Boots', text='am Puss in Boots', ...)] pictures=[] # Empty! No image description generated

Example output for image WITHOUT text:
python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots_no_text' texts=[] # Empty, as expected pictures=[] # Empty! Should contain image description

Docling version

Docling version: 2.55.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0

Python version

Python 3.13.7

Attachments

images.zip

Logs

schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' origin=DocumentOrigin(mimetype='application/pdf', binary_hash=16517824524666051744, filename='pussInBoots.png', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) groups=[] texts=[TextItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=158.33333333333334, t=1115.6666666666667, r=824.6666666666666, b=1034.6666666666667, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 16))], orig='am Puss in Boots', text='am Puss in Boots', formatting=None, hyperlink=None)] pictures=[] tables=[] key_value_items=[] form_items=[] pages={1: PageItem(size=Size(width=1000.0, height=1250.0), image=None, page_no=1)}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions