Skip to content

fix: clear highres images and reset surya KV cache to prevent memory leak in PdfConverter#1044

Open
aashishkumar-tech wants to merge 1 commit into
datalab-to:masterfrom
aashishkumar-tech:fix/memory-leak-pdf-converter
Open

fix: clear highres images and reset surya KV cache to prevent memory leak in PdfConverter#1044
aashishkumar-tech wants to merge 1 commit into
datalab-to:masterfrom
aashishkumar-tech:fix/memory-leak-pdf-converter

Conversation

@aashishkumar-tech

Copy link
Copy Markdown

Problem

When reusing PdfConverter across multiple PDFs in a loop, memory grows
unboundedly (~5-6 GB per PDF) and is never reclaimed, as reported in #1040.

Root Cause

Two issues found:

  1. highres_image (192 DPI PIL images) stored on every page object were
    never cleared after all processors completed
  2. Surya FoundationPredictor.kv_cache was never reset between documents,
    causing torch attention buffers to accumulate

Fix

In marker/converters/pdf.py, after all processors complete:

  • Clear page.highres_image = None for all pages
  • Reset model.foundation_predictor.kv_cache = None for all surya models

Testing

Added two regression tests in tests/converters/test_memory_leak.py:

  • test_highres_images_freed_after_conversion — asserts highres images are None after build
  • test_memory_stable_across_multiple_pdfs — asserts <50MB growth across 3 PDFs

Both tests pass ✅

Closes #1040

…fConverter

When reusing PdfConverter across multiple PDFs, memory grew unboundedly
(~5-6 GB per PDF) due to two issues:
1. highres_image held on every page object after processing completes
2. Surya foundation model KV cache never reset between documents

Fixes: datalab-to#1040
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@aashishkumar-tech

Copy link
Copy Markdown
Author

CLA Assistant Lite bot: Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

You can retrigger this bot by commenting recheck in this Pull Request

I have read the CLA document and I hereby sign the CLA

@aashishkumar-tech

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Jun 6, 2026
@aashishkumar-tech

Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs

1 participant