Skip to content

[BUG] OOM with Large PDF #3345

@cpldxx

Description

@cpldxx

std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend

Bug

When processing large PDFs (700+ pages) with StandardPdfPipeline and the default docling-parse C++ backend, std::bad_alloc errors occur during the preprocessing stage, starting around page 300-345 and affecting the majority of remaining pages. The pipeline continues but layout analysis is silently skipped for failed pages.

Environment

  • Docling version: 2.82
  • Python version: 3.x
  • OS: Windows
  • RAM: 32GB
  • VRAM: 8GB
  • PDF: ~700 pages, math-heavy academic paper (linear algebra formulas, pictures, tables)

Steps to Reproduce

  1. Use a 700+ page PDF containing math formulas, images, and tables
  2. Convert with default settings:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("large_academic_paper.pdf")
  1. Observe std::bad_alloc errors starting around page 300+

Error Output

ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [345]: std::bad_alloc
ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [346]: std::bad_alloc
... (continues for 300+ pages)

Root Cause Analysis

The docling-parse C++ backend accumulates memory internally across pages, bypassing the pipeline's streaming architecture and per-page resource cleanup. With 32GB RAM, memory is exhausted by ~page 345 on image/math-heavy PDFs.

Current Workarounds

  1. Page-range batching — process in 50-100 page chunks:

    for start in range(1, 701, 100):
        end = min(start + 99, 700)
        result = converter.convert("large.pdf", page_range=(start, end))

    Problem: Tables spanning batch boundaries are split into separate TableItem objects with no continuation metadata, breaking semantic integrity.

  2. PyPdfiumDocumentBackend — avoids memory accumulation but significantly reduces table extraction quality for complex academic tables.

  3. Reducing batch sizes (ocr_batch_size=1, layout_batch_size=1) and disabling features — marginal improvement, doesn't solve the core issue.

Feature Request

A built-in mechanism to process large PDFs without accumulating memory in the C++ backend, ideally with:

  1. Incremental memory management — release C++ backend memory per-page or per-batch instead of accumulating across the entire document
  2. Cross-page table continuity — metadata or merging logic to detect and reconstruct tables that span page/batch boundaries (related: Continued table across pages is split into multiple tables #2976)

Related Issues/PRs

Notes

The v2.74.0 fix for std::bad_alloc resolved some cases but does not address 700+ page image-heavy PDFs on 32GB RAM systems. The StandardPdfPipeline has proper streaming architecture with bounded queues and lazy page loading, but the docling-parse backend's internal memory accumulation bypasses these protections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions