-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
When converting images using Docling, the library does not generate picture descriptions for image formats. It only performs OCR text extraction when text is present in the image. The pictures=[]
array remains empty for all images, regardless of whether they contain text or not, making it impossible to retrieve any visual content descriptions.
However, if the same image is embedded in a PDF file, Docling correctly generates picture descriptions and populates the pictures
array. This inconsistency suggests that the image processing pipeline behaves differently for standalone image formats (PNG, JPG) versus images within PDF documents.
This image features a close-up of an adorable ginger tabby kitten with bright, curious blue eyes. The kitten has a soft, orange-striped coat and is gazing up with an expression of innocence and wonder. The background is softly blurred, drawing attention to the kitten’s sweet and delicate features.
Steps to reproduce
-
Set up Docling with the following configuration:
InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options) # or InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), def _build_full_pipeline_options(self) -> PdfPipelineOptions: """Build the full/accurate PdfPipelineOptions configuration. Includes OCR, table structure, formula/code enrichment and picture description. """ pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.do_formula_enrichment = True pipeline_options.do_code_enrichment = True pipeline_options.generate_picture_images = True pipeline_options.enable_remote_services = True pipeline_options.do_picture_description = True pipeline_options.picture_description_options = self._picture_description_options # OpenAI return pipeline_options
-
Convert two versions of the same image:
- Image with text (
pussInBoots.png
): Contains the text "am Puss in Boots" - Image without text (
pussInBoots_no_text.png
): Same image with text removed
result = converter.convert(image_file) markdown_content = result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED)
- Image with text (
-
Inspect the resulting
DoclingDocument
objects:print(result.document.pictures) # Returns empty list [] for both images print(result.document.texts) # Returns text only for image with text
-
Convert the same image embedded in a PDF file and observe that picture descriptions are correctly generated.
Expected behavior: The pictures
array should contain picture items with descriptions of the visual content (e.g., "A cat wearing boots and a hat") for all images, regardless of whether text is present and regardless of whether the image is standalone or embedded in a PDF.
Actual behavior:
- For standalone images with text: Only OCR text extraction occurs (
texts=['am Puss in Boots']
), no picture description generated (pictures=[]
) - For standalone images without text: No content extracted at all (
texts=[]
,pictures=[]
) - For images in PDF files: Picture descriptions are correctly generated and populate the
pictures
array - The image processing pipeline appears to focus exclusively on text extraction for standalone image formats and does not generate visual content descriptions
Example output for image WITH text:
python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' texts=[TextItem(..., orig='am Puss in Boots', text='am Puss in Boots', ...)] pictures=[] # Empty! No image description generated
Example output for image WITHOUT text:
python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots_no_text' texts=[] # Empty, as expected pictures=[] # Empty! Should contain image description
Docling version
Docling version: 2.55.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python version
Python 3.13.7
Attachments
Logs
schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' origin=DocumentOrigin(mimetype='application/pdf', binary_hash=16517824524666051744, filename='pussInBoots.png', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) groups=[] texts=[TextItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=158.33333333333334, t=1115.6666666666667, r=824.6666666666666, b=1034.6666666666667, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 16))], orig='am Puss in Boots', text='am Puss in Boots', formatting=None, hyperlink=None)] pictures=[] tables=[] key_value_items=[] form_items=[] pages={1: PageItem(size=Size(width=1000.0, height=1250.0), image=None, page_no=1)}