Skip to content

fix: manage PDFium backend resource lifecycles to avoid SIGSEGV/SIGTRAP crashes#3180

Merged
cau-git merged 8 commits into
mainfrom
cau/pdfium-managed-lifecycle
Mar 24, 2026
Merged

fix: manage PDFium backend resource lifecycles to avoid SIGSEGV/SIGTRAP crashes#3180
cau-git merged 8 commits into
mainfrom
cau/pdfium-managed-lifecycle

Conversation

@cau-git
Copy link
Copy Markdown
Member

@cau-git cau-git commented Mar 24, 2026

Summary

This PR fixes PDFium-backed resource lifecycle management to prevent leaked native handles and make cleanup deterministic across our PDF backends.

It introduces shared managed lifecycle helpers for PDFium document/page backends, updates both the pypdfium2 and docling-parse backends to use them, and explicitly closes rendered PdfBitmap instances after copying images into PIL.

Changes

  • add ManagedPdfiumDocumentBackend and ManagedPdfiumPageBackend as shared lifecycle wrappers
  • update pypdfium2 backend cleanup to explicitly close:
    • PDF documents
    • PDF pages
    • text pages
  • update docling-parse backend cleanup to explicitly close:
    • parser-backed pages/documents
    • native PDFium pages/documents
  • close temporary PdfBitmap render results after converting them to PIL images
  • simplify the managed lifecycle implementation by removing the previous live-page tracking approach
  • refresh uv.lock

Why

The previous lifecycle handling could leave native PDFium resources open longer than intended and relied on less explicit cleanup behavior. This change makes ownership and teardown clearer, safer, and consistent across both PDF backends.

Notes

supersedes #3172

cau-git and others added 6 commits March 24, 2026 12:52
pypdfium2's to_pil() shares native buffer memory for RGBA/RGBX/L formats
via frombuffer(). The chained render().to_pil().resize() pattern allowed
the PdfBitmap to reach refcount 0 mid-expression, causing GC to invoke
FPDFBitmap_Destroy and free the native buffer while PIL still held a
dangling pointer to it — resulting in non-deterministic SIGSEGV crashes
in concurrent scenarios.

Fix: store the bitmap explicitly, copy the PIL image to detach it from
the shared native buffer, then close the bitmap under the lock before
proceeding with the resize on the independent copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…live-page tracking

Introduces ManagedPdfiumDocumentBackend / ManagedPdfiumPageBackend base
classes that both PDF backends now inherit from. Key changes:

- Live pages are tracked in a set on the document; document unload waits
  for all pages to be released before tearing down native handles.
- Page and document unload now call explicit .close() on native PDFium
  objects under the lock, rather than just nulling Python references.
  This makes teardown deterministic rather than relying on GC finalizers
  which can fire from any thread without the lock.
- text_page is explicitly closed before _ppage to respect the PDFium
  parent/child handle hierarchy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Condition, Lock, _live_pages set, _closing flag, and owner back-ref
on pages were remnants of the Group-3b pipeline defensive shutdown that
was not included here. The pipeline always unloads page backends before
calling document.unload(), so _close_live_pages() was always a no-op
and notify_all() had zero waiters.

Reduced ManagedPdfiumDocumentBackend/ManagedPdfiumPageBackend to just
a _closed guard and the abstract _close_native_* dispatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 24, 2026

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 24, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: b3f4e66
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 79b1894
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: b389c82
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 5e3510f

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git changed the title fix: manage PDFium backend resource lifecycles to avoid SIGSEGV/SIGTRAP occurences fix: manage PDFium backend resource lifecycles to avoid SIGSEGV/SIGTRAP crashes Mar 24, 2026
@cau-git cau-git marked this pull request as ready for review March 24, 2026 15:19
@cau-git cau-git requested review from PeterStaar-IBM and dolfim-ibm and removed request for dolfim-ibm March 24, 2026 15:19
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 96.51163% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/managed_pdfium_backend.py 90.32% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git merged commit a0fc3c9 into main Mar 24, 2026
27 checks passed
@cau-git cau-git deleted the cau/pdfium-managed-lifecycle branch March 24, 2026 16:59
@dosubot dosubot Bot mentioned this pull request Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants