Skip to content

Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs #1040

@statzhero

Description

@statzhero

Version

  • marker-pdf 1.10.2
  • surya-ocr 0.17.1
  • pdftext 0.6.3
  • torch 2.12.0

Problem

RSS grows with each PDF when reusing a single PdfConverter in a loop. Processing 10 documents (200-400 pages each) reaches 60 GB peak RSS on CUDA. The memory is never reclaimed between documents.

Reproduction (stylized)

from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

config = ConfigParser({"strip_existing_ocr": "True"}).generate_config_dict()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict, config=config)

for pdf in pdfs:  # 10 PDFs, 200-400 pages each
    document = converter.build_document(str(pdf))
    # ... render markdown + json ...
    del document
    gc.collect()
    torch.cuda.empty_cache()
    # RSS still grows ~5-6 GB per PDF and never drops

Environment

  • SLURM cluster, NVIDIA H200 GPUs
  • CUDA, Linux

What we tried

  • del document; gc.collect(); torch.cuda.empty_cache() after each PDF -- no effect
  • Freeing all result objects (markdown, json_data) after saving to disk -- no effect

Evidence

SLURM sacct MaxRSS for 10 PDFs in one task:

JobID             State    Elapsed     MaxRSS
------------ ---------- ---------- ----------
509368_0      COMPLETED   00:55:58
509368_0.ba+  COMPLETED   00:55:58  60006800K

Related

I see in #487 that maxtasksperchild=1 was added to the multiprocessing pool to address this for the CLI. For users calling PdfConverter directly in a loop (e.g. SLURM array jobs), there is no equivalent fix.

Is there a recommended way to reset converter state between documents, or should we recreate the converter (and reload models) for each PDF?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions