Version
- marker-pdf 1.10.2
- surya-ocr 0.17.1
- pdftext 0.6.3
- torch 2.12.0
Problem
RSS grows with each PDF when reusing a single PdfConverter in a loop. Processing 10 documents (200-400 pages each) reaches 60 GB peak RSS on CUDA. The memory is never reclaimed between documents.
Reproduction (stylized)
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
config = ConfigParser({"strip_existing_ocr": "True"}).generate_config_dict()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict, config=config)
for pdf in pdfs: # 10 PDFs, 200-400 pages each
document = converter.build_document(str(pdf))
# ... render markdown + json ...
del document
gc.collect()
torch.cuda.empty_cache()
# RSS still grows ~5-6 GB per PDF and never drops
Environment
- SLURM cluster, NVIDIA H200 GPUs
- CUDA, Linux
What we tried
del document; gc.collect(); torch.cuda.empty_cache() after each PDF -- no effect
- Freeing all result objects (markdown, json_data) after saving to disk -- no effect
Evidence
SLURM sacct MaxRSS for 10 PDFs in one task:
JobID State Elapsed MaxRSS
------------ ---------- ---------- ----------
509368_0 COMPLETED 00:55:58
509368_0.ba+ COMPLETED 00:55:58 60006800K
Related
I see in #487 that maxtasksperchild=1 was added to the multiprocessing pool to address this for the CLI. For users calling PdfConverter directly in a loop (e.g. SLURM array jobs), there is no equivalent fix.
Is there a recommended way to reset converter state between documents, or should we recreate the converter (and reload models) for each PDF?
Version
Problem
RSS grows with each PDF when reusing a single
PdfConverterin a loop. Processing 10 documents (200-400 pages each) reaches 60 GB peak RSS on CUDA. The memory is never reclaimed between documents.Reproduction (stylized)
Environment
What we tried
del document; gc.collect(); torch.cuda.empty_cache()after each PDF -- no effectEvidence
SLURM
sacctMaxRSS for 10 PDFs in one task:Related
I see in #487 that
maxtasksperchild=1was added to the multiprocessing pool to address this for the CLI. For users callingPdfConverterdirectly in a loop (e.g. SLURM array jobs), there is no equivalent fix.Is there a recommended way to reset converter state between documents, or should we recreate the converter (and reload models) for each PDF?