You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend
Bug
When processing large PDFs (700+ pages) with StandardPdfPipeline and the default docling-parse C++ backend, std::bad_alloc errors occur during the preprocessing stage, starting around page 300-345 and affecting the majority of remaining pages. The pipeline continues but layout analysis is silently skipped for failed pages.
Observe std::bad_alloc errors starting around page 300+
Error Output
ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [345]: std::bad_alloc
ERROR:docling.pipeline.standard_pdf_pipeline:Stage preprocess failed for run 1, pages [346]: std::bad_alloc
... (continues for 300+ pages)
Root Cause Analysis
The docling-parse C++ backend accumulates memory internally across pages, bypassing the pipeline's streaming architecture and per-page resource cleanup. With 32GB RAM, memory is exhausted by ~page 345 on image/math-heavy PDFs.
Current Workarounds
Page-range batching — process in 50-100 page chunks:
The v2.74.0 fix for std::bad_alloc resolved some cases but does not address 700+ page image-heavy PDFs on 32GB RAM systems. The StandardPdfPipeline has proper streaming architecture with bounded queues and lazy page loading, but the docling-parse backend's internal memory accumulation bypasses these protections.
std::bad_alloc OOM on large PDFs (700+ pages) with docling-parse backend
Bug
When processing large PDFs (700+ pages) with
StandardPdfPipelineand the defaultdocling-parseC++ backend,std::bad_allocerrors occur during the preprocessing stage, starting around page 300-345 and affecting the majority of remaining pages. The pipeline continues but layout analysis is silently skipped for failed pages.Environment
Steps to Reproduce
std::bad_allocerrors starting around page 300+Error Output
Root Cause Analysis
The
docling-parseC++ backend accumulates memory internally across pages, bypassing the pipeline's streaming architecture and per-page resource cleanup. With 32GB RAM, memory is exhausted by ~page 345 on image/math-heavy PDFs.Current Workarounds
Page-range batching — process in 50-100 page chunks:
Problem: Tables spanning batch boundaries are split into separate
TableItemobjects with no continuation metadata, breaking semantic integrity.PyPdfiumDocumentBackend — avoids memory accumulation but significantly reduces table extraction quality for complex academic tables.
Reducing batch sizes (
ocr_batch_size=1,layout_batch_size=1) and disabling features — marginal improvement, doesn't solve the core issue.Feature Request
A built-in mechanism to process large PDFs without accumulating memory in the C++ backend, ideally with:
Related Issues/PRs
Notes
The v2.74.0 fix for
std::bad_allocresolved some cases but does not address 700+ page image-heavy PDFs on 32GB RAM systems. TheStandardPdfPipelinehas proper streaming architecture with bounded queues and lazy page loading, but thedocling-parsebackend's internal memory accumulation bypasses these protections.