Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs

### Version
- marker-pdf 1.10.2
- surya-ocr 0.17.1
- pdftext 0.6.3
- torch 2.12.0

### Problem

RSS grows with each PDF when reusing a single `PdfConverter` in a loop. Processing 10 documents (200-400 pages each) reaches 60 GB peak RSS on CUDA. The memory is never reclaimed between documents.

### Reproduction (stylized)

```python
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

config = ConfigParser({"strip_existing_ocr": "True"}).generate_config_dict()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict, config=config)

for pdf in pdfs:  # 10 PDFs, 200-400 pages each
    document = converter.build_document(str(pdf))
    # ... render markdown + json ...
    del document
    gc.collect()
    torch.cuda.empty_cache()
    # RSS still grows ~5-6 GB per PDF and never drops
```

### Environment
- SLURM cluster, NVIDIA H200 GPUs
- CUDA, Linux

### What we tried
- `del document; gc.collect(); torch.cuda.empty_cache()` after each PDF -- no effect
- Freeing all result objects (markdown, json_data) after saving to disk -- no effect
### Evidence

SLURM `sacct` MaxRSS for 10 PDFs in one task:

```
JobID             State    Elapsed     MaxRSS
------------ ---------- ---------- ----------
509368_0      COMPLETED   00:55:58
509368_0.ba+  COMPLETED   00:55:58  60006800K
```

### Related

I see in #487 that `maxtasksperchild=1` was added to the multiprocessing pool to address this for the CLI. For users calling `PdfConverter` directly in a loop (e.g. SLURM array jobs), there is no equivalent fix.

Is there a recommended way to reset converter state between documents, or should we recreate the converter (and reload models) for each PDF?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs #1040

Version

Problem

Reproduction (stylized)

Environment

What we tried

Evidence

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory seems to grow unbounded when reusing PdfConverter across multiple PDFs #1040

Description

Version

Problem

Reproduction (stylized)

Environment

What we tried

Evidence

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions