fix: clear highres images and reset surya KV cache to prevent memory leak in PdfConverter by aashishkumar-tech · Pull Request #1044 · datalab-to/marker

aashishkumar-tech · 2026-06-06T00:54:54Z

Problem

When reusing PdfConverter across multiple PDFs in a loop, memory grows
unboundedly (~5-6 GB per PDF) and is never reclaimed, as reported in #1040.

Root Cause

Two issues found:

highres_image (192 DPI PIL images) stored on every page object were
never cleared after all processors completed
Surya FoundationPredictor.kv_cache was never reset between documents,
causing torch attention buffers to accumulate

Fix

In marker/converters/pdf.py, after all processors complete:

Clear page.highres_image = None for all pages
Reset model.foundation_predictor.kv_cache = None for all surya models

Testing

Added two regression tests in tests/converters/test_memory_leak.py:

test_highres_images_freed_after_conversion — asserts highres images are None after build
test_memory_stable_across_multiple_pdfs — asserts <50MB growth across 3 PDFs

Both tests pass ✅

Closes #1040

…fConverter When reusing PdfConverter across multiple PDFs, memory grew unboundedly (~5-6 GB per PDF) due to two issues: 1. highres_image held on every page object after processing completes 2. Surya foundation model KV cache never reset between documents Fixes: datalab-to#1040

github-actions · 2026-06-06T00:55:03Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

aashishkumar-tech · 2026-06-06T00:57:34Z

CLA Assistant Lite bot: Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

You can retrigger this bot by commenting recheck in this Pull Request

I have read the CLA document and I hereby sign the CLA

aashishkumar-tech · 2026-06-06T00:59:23Z

I have read the CLA Document and I hereby sign the CLA

aashishkumar-tech · 2026-06-06T00:59:49Z

recheck

github-actions Bot added a commit that referenced this pull request Jun 6, 2026

@aashishkumar-tech has signed the CLA in #1044

4d9e7bd

aashishkumar-tech mentioned this pull request Jun 6, 2026

Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs #1040

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clear highres images and reset surya KV cache to prevent memory leak in PdfConverter#1044

fix: clear highres images and reset surya KV cache to prevent memory leak in PdfConverter#1044
aashishkumar-tech wants to merge 1 commit into
datalab-to:masterfrom
aashishkumar-tech:fix/memory-leak-pdf-converter

aashishkumar-tech commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aashishkumar-tech commented Jun 6, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

aashishkumar-tech commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 6, 2026 •

edited

Loading