Image formats not generating picture descriptions, only OCR text extraction

When converting images using Docling, the library **does not generate picture descriptions for image formats**. It only performs OCR text extraction when text is present in the image. The `pictures=[]` array remains empty for all images, regardless of whether they contain text or not, making it impossible to retrieve any visual content descriptions.

**However**, if the **same image is embedded in a PDF file**, Docling correctly generates picture descriptions and populates the `pictures` array. This inconsistency suggests that the image processing pipeline behaves differently for standalone image formats (PNG, JPG) versus images within PDF documents.

> _This image features a close-up of an adorable ginger tabby kitten with bright, curious blue eyes. The kitten has a soft, orange-striped coat and is gazing up with an expression of innocence and wonder. The background is softly blurred, drawing attention to the kitten’s sweet and delicate features._

### Steps to reproduce
1. Set up Docling with the following configuration:
   ```python
   InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
   # or InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),

    def _build_full_pipeline_options(self) -> PdfPipelineOptions:
        """Build the full/accurate PdfPipelineOptions configuration.

        Includes OCR, table structure, formula/code enrichment and picture description.
        """
        pipeline_options = PdfPipelineOptions()
        pipeline_options.do_ocr = True
        pipeline_options.do_table_structure = True
        pipeline_options.do_formula_enrichment = True
        pipeline_options.do_code_enrichment = True
        pipeline_options.generate_picture_images = True
        pipeline_options.enable_remote_services = True
        pipeline_options.do_picture_description = True
        pipeline_options.picture_description_options = self._picture_description_options # OpenAI
        return pipeline_options
   ```

2. Convert two versions of the same image:
   - **Image with text** (`pussInBoots.png`): Contains the text "am Puss in Boots"
   - **Image without text** (`pussInBoots_no_text.png`): Same image with text removed

   ```python
   result = converter.convert(image_file)
   markdown_content = result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED)
   ```

3. Inspect the resulting `DoclingDocument` objects:
   ```python
   print(result.document.pictures)  # Returns empty list [] for both images
   print(result.document.texts)     # Returns text only for image with text
   ```

4. Convert the same image embedded in a PDF file and observe that picture descriptions are correctly generated.

**Expected behavior:** The `pictures` array should contain picture items with descriptions of the visual content (e.g., "A cat wearing boots and a hat") for all images, regardless of whether text is present and regardless of whether the image is standalone or embedded in a PDF.

**Actual behavior:** 
- For **standalone images with text**: Only OCR text extraction occurs (`texts=['am Puss in Boots']`), no picture description generated (`pictures=[]`)
- For **standalone images without text**: No content extracted at all (`texts=[]`, `pictures=[]`)
- For **images in PDF files**: Picture descriptions are correctly generated and populate the `pictures` array
- The image processing pipeline appears to focus exclusively on text extraction for standalone image formats and does not generate visual content descriptions

**Example output for image WITH text:**

`python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' texts=[TextItem(..., orig='am Puss in Boots', text='am Puss in Boots', ...)] pictures=[] # Empty! No image description generated`

**Example output for image WITHOUT text:**
`python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots_no_text' texts=[] # Empty, as expected pictures=[] # Empty! Should contain image description`

### Docling version
```
Docling version: 2.55.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
```

### Python version
```
Python 3.13.7
```

### Attachments
[images.zip](https://github.com/user-attachments/files/22866808/images.zip)


### Logs

```python
schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' origin=DocumentOrigin(mimetype='application/pdf', binary_hash=16517824524666051744, filename='pussInBoots.png', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) groups=[] texts=[TextItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=158.33333333333334, t=1115.6666666666667, r=824.6666666666666, b=1034.6666666666667, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 16))], orig='am Puss in Boots', text='am Puss in Boots', formatting=None, hyperlink=None)] pictures=[] tables=[] key_value_items=[] form_items=[] pages={1: PageItem(size=Size(width=1000.0, height=1250.0), image=None, page_no=1)}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image formats not generating picture descriptions, only OCR text extraction #2446

Steps to reproduce

Docling version

Python version

Attachments

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Image formats not generating picture descriptions, only OCR text extraction #2446

Description

Steps to reproduce

Docling version

Python version

Attachments

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions