[BUG] OOM with Large PDF

***


## std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend

### Bug

When processing large PDFs (700+ pages) with `StandardPdfPipeline` and the default `docling-parse` C++ backend, `std::bad_alloc` errors occur during the preprocessing stage, starting around page 300-345 and affecting the majority of remaining pages. The pipeline continues but layout analysis is silently skipped for failed pages.

### Environment

- **Docling version**: 2.82
- **Python version**: 3.x
- **OS**: Windows
- **RAM**: 32GB
- **VRAM**: 8GB
- **PDF**: ~700 pages, math-heavy academic paper (linear algebra formulas, pictures, tables)

### Steps to Reproduce

1. Use a 700+ page PDF containing math formulas, images, and tables
2. Convert with default settings:

```python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("large_academic_paper.pdf")
```

3. Observe `std::bad_alloc` errors starting around page 300+

### Error Output

```
ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [345]: std::bad_alloc
ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [346]: std::bad_alloc
... (continues for 300+ pages)
```

### Root Cause Analysis

The `docling-parse` C++ backend accumulates memory internally across pages, bypassing the pipeline's streaming architecture and per-page resource cleanup. With 32GB RAM, memory is exhausted by ~page 345 on image/math-heavy PDFs.

### Current Workarounds

1. **Page-range batching** — process in 50-100 page chunks:

   ```python
   for start in range(1, 701, 100):
       end = min(start + 99, 700)
       result = converter.convert("large.pdf", page_range=(start, end))
   ```

   **Problem**: Tables spanning batch boundaries are split into separate `TableItem` objects with no continuation metadata, breaking semantic integrity.

2. **PyPdfiumDocumentBackend** — avoids memory accumulation but significantly reduces table extraction quality for complex academic tables.

3. **Reducing batch sizes** (`ocr_batch_size=1`, `layout_batch_size=1`) and disabling features — marginal improvement, doesn't solve the core issue.

### Feature Request

A built-in mechanism to process large PDFs without accumulating memory in the C++ backend, ideally with:

1. **Incremental memory management** — release C++ backend memory per-page or per-batch instead of accumulating across the entire document
2. **Cross-page table continuity** — metadata or merging logic to detect and reconstruct tables that span page/batch boundaries (related: #2976)

### Related Issues/PRs

- #2976 — Continued table across pages is split into multiple tables
- #3162 — page_chunk_size PR (closed)
- #3088 — Large document OOM

### Notes

The v2.74.0 fix for `std::bad_alloc` resolved some cases but does not address 700+ page image-heavy PDFs on 32GB RAM systems. The `StandardPdfPipeline` has proper streaming architecture with bounded queues and lazy page loading, but the `docling-parse` backend's internal memory accumulation bypasses these protections.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OOM with Large PDF #3345

std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend

Bug

Environment

Steps to Reproduce

Error Output

Root Cause Analysis

Current Workarounds

Feature Request

Related Issues/PRs

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] OOM with Large PDF #3345

Description

std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend

Bug

Environment

Steps to Reproduce

Error Output

Root Cause Analysis

Current Workarounds

Feature Request

Related Issues/PRs

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions