perf: disable ONNX Runtime CPU memory arena to reduce idle memory by KRRT7 · Pull Request #481 · Unstructured-IO/unstructured-inference

KRRT7 · 2026-03-16T23:18:28Z

Summary

Disables ONNX Runtime's CPU memory arena (enable_cpu_mem_arena = False) for
both YoloX and Detectron2 ONNX sessions to reduce idle memory usage.

The arena pre-allocates a memory pool on the first session.run() and never
returns it to the OS by default
(docs). Since we run
one request at a time, the pool sits idle between
inference calls with no benefit. Disabling it lets the OS reclaim memory after
each call.

Motivation

ONNX Runtime's default arena extend strategy (kNextPowerOfTwo) can
aggressively over-allocate — an ORT maintainer confirmed in
microsoft/onnxruntime#11627
that this "might allocate more memory than needed, which could be a waste"
(#11118,
#13504,
#22271).

Benchmarks

We ran benchmarks across various PDFs. Representative results from a 16-page
document using memray in isolated subprocesses:

YoloX

Metric	Arena ON	Arena OFF	Difference
Peak RSS	1126.0 MB	1176.3 MB	+50.3 MB (+4.5%)
Final RSS (idle)	948.4 MB	730.4 MB	-218.0 MB (-23.0%)
Peak heap (high watermark)	549.5 MB	462.4 MB	-87.1 MB (-15.9%)
Total time	79.21 s	89.12 s

YoloX Quantized

Metric	Arena ON	Arena OFF	Difference
Peak RSS	1004.3 MB	1061.1 MB	+56.8 MB (+5.7%)
Final RSS (idle)	931.0 MB	682.1 MB	-248.9 MB (-26.7%)
Peak heap (high watermark)	397.8 MB	310.5 MB	-87.3 MB (-21.9%)
Total time	38.07 s	38.89 s

Key takeaways:

~87 MB less peak heap (-16 to -22%)
218-249 MB less idle RSS — memory returned to the OS between calls
A short reproducible script is included below.

Changes

unstructured_inference/models/yolox.py — create SessionOptions with
enable_cpu_mem_arena = False before creating the InferenceSession
unstructured_inference/models/detectron2onnx.py — same change

Benchmark script

benchmarks/bench_onnx_memory_arena.py (click to expand)

"""Benchmark: ONNX Runtime CPU memory arena enabled vs disabled.

Runs process_file_with_model against a real PDF with the arena toggled
on and off, measuring memory via memray in isolated subprocesses.

Usage:
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py --pdf sample-docs/layout-parser-paper.pdf
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py --model yolox_quantized
"""

from __future__ import annotations

import argparse
import json
import os
import subprocess
import tempfile
import textwrap


def _run_scenario(*, model: str, pdf_path: str, arena_enabled: bool) -> dict:
    """Run a single benchmark scenario in an isolated subprocess."""
    output_path = tempfile.mktemp(suffix=".json")
    memray_path = tempfile.mktemp(suffix=".bin")

    script = textwrap.dedent(f"""\
        import gc
        import json
        import os
        import tempfile
        import time

        import memray
        import onnxruntime

        MODEL = {model!r}
        PDF_PATH = {pdf_path!r}
        ARENA_ENABLED = {arena_enabled!r}
        OUTPUT_PATH = {output_path!r}
        MEMRAY_PATH = {memray_path!r}

        # Monkey-patch InferenceSession to control the arena setting
        _OriginalSession = onnxruntime.InferenceSession

        class _PatchedSession(_OriginalSession):
            def __init__(self, *args, **kwargs):
                sess_options = kwargs.get("sess_options") or onnxruntime.SessionOptions()
                sess_options.enable_cpu_mem_arena = ARENA_ENABLED
                kwargs["sess_options"] = sess_options
                super().__init__(*args, **kwargs)

        onnxruntime.InferenceSession = _PatchedSession

        from unstructured_inference.inference.layout import process_file_with_model

        with memray.Tracker(MEMRAY_PATH, native_traces=True):
            t0 = time.perf_counter()
            layout = process_file_with_model(PDF_PATH, model_name=MODEL)
            elapsed = time.perf_counter() - t0
            num_pages = len(layout.pages)
            num_elements = sum(len(p.elements) for p in layout.pages)
            # Force cleanup of layout data, keep session alive
            del layout
            gc.collect()

        # Read memray results
        reader = memray.FileReader(MEMRAY_PATH)
        snapshots = list(reader.get_memory_snapshots())
        peak_rss = max(s.rss for s in snapshots) if snapshots else 0
        final_rss = snapshots[-1].rss if snapshots else 0

        hwm_records = list(reader.get_high_watermark_allocation_records())
        peak_heap = sum(r.size for r in hwm_records)

        reader.close()
        os.unlink(MEMRAY_PATH)

        results = {{
            "model": MODEL,
            "pdf": os.path.basename(PDF_PATH),
            "arena_enabled": ARENA_ENABLED,
            "num_pages": num_pages,
            "num_elements": num_elements,
            "elapsed_s": round(elapsed, 2),
            "peak_rss_mb": round(peak_rss / 1024 / 1024, 1),
            "final_rss_mb": round(final_rss / 1024 / 1024, 1),
            "peak_heap_mb": round(peak_heap / 1024 / 1024, 1),
        }}

        with open(OUTPUT_PATH, "w") as f:
            json.dump(results, f)
    """)

    result = subprocess.run(
        ["uv", "run", "--with", "memray", "python", "-c", script],
        capture_output=True,
        text=True,
        cwd=os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
    )

    if result.returncode != 0:
        print(f"STDERR ({model}, arena={arena_enabled}):\n{result.stderr}")
        raise RuntimeError(f"Scenario failed: {model}, arena={arena_enabled}")

    with open(output_path) as f:
        data = json.load(f)
    os.unlink(output_path)
    return data


def _fmt_mb(val: float) -> str:
    return f"{val:>8.1f} MB"


def _fmt_diff(before: float, after: float) -> str:
    diff = after - before
    pct = (diff / before * 100) if before else 0
    sign = "+" if diff >= 0 else ""
    return f"{sign}{diff:.1f} MB ({sign}{pct:.1f}%)"


def main():
    parser = argparse.ArgumentParser(description="Benchmark ONNX CPU memory arena")
    parser.add_argument("--pdf", default="sample-docs/layout-parser-paper.pdf")
    parser.add_argument("--model", default="yolox")
    args = parser.parse_args()

    pdf_path = args.pdf
    model = args.model

    print(f"PDF:   {pdf_path}")
    print(f"Model: {model}")

    results = []
    for arena_enabled in [True, False]:
        label = "arena ON" if arena_enabled else "arena OFF"
        print(f"\nRunning {label}...", flush=True)
        r = _run_scenario(model=model, pdf_path=pdf_path, arena_enabled=arena_enabled)
        results.append(r)
        print(f"  Pages: {r['num_pages']}, Elements: {r['num_elements']}, Time: {r['elapsed_s']}s")

    on = results[0]
    off = results[1]

    print(f"\n{'=' * 72}")
    print(f"RESULTS: {model} on {os.path.basename(pdf_path)} ({on['num_pages']} pages)")
    print(f"{'=' * 72}")
    print(f"{'Metric':<28} {'Arena ON':>12} {'Arena OFF':>12} {'Difference':>20}")
    print(f"{'-' * 72}")
    for label, key in [
        ("Peak RSS", "peak_rss_mb"),
        ("Final RSS (idle)", "final_rss_mb"),
        ("Peak heap (high watermark)", "peak_heap_mb"),
    ]:
        print(
            f"{label:<28} "
            f"{_fmt_mb(on[key]):>12} "
            f"{_fmt_mb(off[key]):>12} "
            f"{_fmt_diff(on[key], off[key]):>20}"
        )
    print(
        f"{'Total time':<28} "
        f"{on['elapsed_s']:>10.2f} s "
        f"{off['elapsed_s']:>10.2f} s"
    )

    rss_saved = on["final_rss_mb"] - off["final_rss_mb"]
    heap_saved = on["peak_heap_mb"] - off["peak_heap_mb"]
    if rss_saved > 0:
        print(f"\n  -> Disabling arena saves {rss_saved:.1f} MB idle RSS")
    if heap_saved > 0:
        print(f"  -> Disabling arena saves {heap_saved:.1f} MB peak heap")


if __name__ == "__main__":
    main()

The CPU memory arena pre-allocates a pool that persists for the session lifetime, even between inference calls. Disabling it allows the OS to reclaim memory when models are idle.

KRRT7 · 2026-03-17T04:17:39Z

Arena strategy alternatives — benchmark results

Tested 5 approaches on layout-parser-paper.pdf (16 pages, ORT 1.24.1) to see if we can get idle RSS savings without the peak RSS cost:

Approach	Peak RSS vs baseline	Idle RSS vs baseline	Throughput
`enable_cpu_mem_arena = False` (this PR)	+3–7%	-13 to -16%	~same
`kSameAsRequested` strategy	~same	~same	~same
Session disposal after doc	~same	unreliable	~same
`kSameAsRequested` + arena shrinkage	+12%	-0 to -7%	30–66% slower

Arena OFF is the only approach that consistently delivers idle savings. The alternatives either don't work or are actively worse.

KRRT7 · 2026-03-19T07:46:15Z

closing in favor of #484

KRRT7 added 2 commits March 16, 2026 16:46

perf: disable ONNX Runtime CPU memory arena to reduce idle memory

b859116

The CPU memory arena pre-allocates a pool that persists for the session lifetime, even between inference calls. Disabling it allows the OS to reclaim memory when models are idle.

chore: bump version to 1.5.3

e429bdd

KRRT7 closed this Mar 19, 2026

KRRT7 deleted the disable-onnx-cpu-mem-arena branch March 19, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481

perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481
KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
KRRT7:disable-onnx-cpu-mem-arena

KRRT7 commented Mar 16, 2026 •

edited

Loading

Uh oh!

KRRT7 commented Mar 17, 2026

Uh oh!

KRRT7 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KRRT7 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Benchmarks

YoloX

YoloX Quantized

Changes

Benchmark script

Uh oh!

KRRT7 commented Mar 17, 2026

Arena strategy alternatives — benchmark results

Uh oh!

KRRT7 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KRRT7 commented Mar 16, 2026 •

edited

Loading