Skip to content

perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481

Closed
KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
KRRT7:disable-onnx-cpu-mem-arena
Closed

perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481
KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
KRRT7:disable-onnx-cpu-mem-arena

Conversation

@KRRT7
Copy link

@KRRT7 KRRT7 commented Mar 16, 2026

Summary

Disables ONNX Runtime's CPU memory arena (enable_cpu_mem_arena = False) for
both YoloX and Detectron2 ONNX sessions to reduce idle memory usage.

The arena pre-allocates a memory pool on the first session.run() and never
returns it to the OS
by default
(docs). Since we run
one request at a time, the pool sits idle between
inference calls with no benefit. Disabling it lets the OS reclaim memory after
each call.

Motivation

ONNX Runtime's default arena extend strategy (kNextPowerOfTwo) can
aggressively over-allocate — an ORT maintainer confirmed in
microsoft/onnxruntime#11627
that this "might allocate more memory than needed, which could be a waste"
(#11118,
#13504,
#22271).

Benchmarks

We ran benchmarks across various PDFs. Representative results from a 16-page
document using memray in isolated subprocesses:

YoloX

Metric Arena ON Arena OFF Difference
Peak RSS 1126.0 MB 1176.3 MB +50.3 MB (+4.5%)
Final RSS (idle) 948.4 MB 730.4 MB -218.0 MB (-23.0%)
Peak heap (high watermark) 549.5 MB 462.4 MB -87.1 MB (-15.9%)
Total time 79.21 s 89.12 s

YoloX Quantized

Metric Arena ON Arena OFF Difference
Peak RSS 1004.3 MB 1061.1 MB +56.8 MB (+5.7%)
Final RSS (idle) 931.0 MB 682.1 MB -248.9 MB (-26.7%)
Peak heap (high watermark) 397.8 MB 310.5 MB -87.3 MB (-21.9%)
Total time 38.07 s 38.89 s

Key takeaways:

  • ~87 MB less peak heap (-16 to -22%)
  • 218-249 MB less idle RSS — memory returned to the OS between calls
    A short reproducible script is included below.

Changes

  • unstructured_inference/models/yolox.py — create SessionOptions with
    enable_cpu_mem_arena = False before creating the InferenceSession
  • unstructured_inference/models/detectron2onnx.py — same change

Benchmark script

benchmarks/bench_onnx_memory_arena.py (click to expand)
"""Benchmark: ONNX Runtime CPU memory arena enabled vs disabled.

Runs process_file_with_model against a real PDF with the arena toggled
on and off, measuring memory via memray in isolated subprocesses.

Usage:
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py --pdf sample-docs/layout-parser-paper.pdf
    uv run --with memray python benchmarks/bench_onnx_memory_arena.py --model yolox_quantized
"""

from __future__ import annotations

import argparse
import json
import os
import subprocess
import tempfile
import textwrap


def _run_scenario(*, model: str, pdf_path: str, arena_enabled: bool) -> dict:
    """Run a single benchmark scenario in an isolated subprocess."""
    output_path = tempfile.mktemp(suffix=".json")
    memray_path = tempfile.mktemp(suffix=".bin")

    script = textwrap.dedent(f"""\
        import gc
        import json
        import os
        import tempfile
        import time

        import memray
        import onnxruntime

        MODEL = {model!r}
        PDF_PATH = {pdf_path!r}
        ARENA_ENABLED = {arena_enabled!r}
        OUTPUT_PATH = {output_path!r}
        MEMRAY_PATH = {memray_path!r}

        # Monkey-patch InferenceSession to control the arena setting
        _OriginalSession = onnxruntime.InferenceSession

        class _PatchedSession(_OriginalSession):
            def __init__(self, *args, **kwargs):
                sess_options = kwargs.get("sess_options") or onnxruntime.SessionOptions()
                sess_options.enable_cpu_mem_arena = ARENA_ENABLED
                kwargs["sess_options"] = sess_options
                super().__init__(*args, **kwargs)

        onnxruntime.InferenceSession = _PatchedSession

        from unstructured_inference.inference.layout import process_file_with_model

        with memray.Tracker(MEMRAY_PATH, native_traces=True):
            t0 = time.perf_counter()
            layout = process_file_with_model(PDF_PATH, model_name=MODEL)
            elapsed = time.perf_counter() - t0
            num_pages = len(layout.pages)
            num_elements = sum(len(p.elements) for p in layout.pages)
            # Force cleanup of layout data, keep session alive
            del layout
            gc.collect()

        # Read memray results
        reader = memray.FileReader(MEMRAY_PATH)
        snapshots = list(reader.get_memory_snapshots())
        peak_rss = max(s.rss for s in snapshots) if snapshots else 0
        final_rss = snapshots[-1].rss if snapshots else 0

        hwm_records = list(reader.get_high_watermark_allocation_records())
        peak_heap = sum(r.size for r in hwm_records)

        reader.close()
        os.unlink(MEMRAY_PATH)

        results = {{
            "model": MODEL,
            "pdf": os.path.basename(PDF_PATH),
            "arena_enabled": ARENA_ENABLED,
            "num_pages": num_pages,
            "num_elements": num_elements,
            "elapsed_s": round(elapsed, 2),
            "peak_rss_mb": round(peak_rss / 1024 / 1024, 1),
            "final_rss_mb": round(final_rss / 1024 / 1024, 1),
            "peak_heap_mb": round(peak_heap / 1024 / 1024, 1),
        }}

        with open(OUTPUT_PATH, "w") as f:
            json.dump(results, f)
    """)

    result = subprocess.run(
        ["uv", "run", "--with", "memray", "python", "-c", script],
        capture_output=True,
        text=True,
        cwd=os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
    )

    if result.returncode != 0:
        print(f"STDERR ({model}, arena={arena_enabled}):\n{result.stderr}")
        raise RuntimeError(f"Scenario failed: {model}, arena={arena_enabled}")

    with open(output_path) as f:
        data = json.load(f)
    os.unlink(output_path)
    return data


def _fmt_mb(val: float) -> str:
    return f"{val:>8.1f} MB"


def _fmt_diff(before: float, after: float) -> str:
    diff = after - before
    pct = (diff / before * 100) if before else 0
    sign = "+" if diff >= 0 else ""
    return f"{sign}{diff:.1f} MB ({sign}{pct:.1f}%)"


def main():
    parser = argparse.ArgumentParser(description="Benchmark ONNX CPU memory arena")
    parser.add_argument("--pdf", default="sample-docs/layout-parser-paper.pdf")
    parser.add_argument("--model", default="yolox")
    args = parser.parse_args()

    pdf_path = args.pdf
    model = args.model

    print(f"PDF:   {pdf_path}")
    print(f"Model: {model}")

    results = []
    for arena_enabled in [True, False]:
        label = "arena ON" if arena_enabled else "arena OFF"
        print(f"\nRunning {label}...", flush=True)
        r = _run_scenario(model=model, pdf_path=pdf_path, arena_enabled=arena_enabled)
        results.append(r)
        print(f"  Pages: {r['num_pages']}, Elements: {r['num_elements']}, Time: {r['elapsed_s']}s")

    on = results[0]
    off = results[1]

    print(f"\n{'=' * 72}")
    print(f"RESULTS: {model} on {os.path.basename(pdf_path)} ({on['num_pages']} pages)")
    print(f"{'=' * 72}")
    print(f"{'Metric':<28} {'Arena ON':>12} {'Arena OFF':>12} {'Difference':>20}")
    print(f"{'-' * 72}")
    for label, key in [
        ("Peak RSS", "peak_rss_mb"),
        ("Final RSS (idle)", "final_rss_mb"),
        ("Peak heap (high watermark)", "peak_heap_mb"),
    ]:
        print(
            f"{label:<28} "
            f"{_fmt_mb(on[key]):>12} "
            f"{_fmt_mb(off[key]):>12} "
            f"{_fmt_diff(on[key], off[key]):>20}"
        )
    print(
        f"{'Total time':<28} "
        f"{on['elapsed_s']:>10.2f} s "
        f"{off['elapsed_s']:>10.2f} s"
    )

    rss_saved = on["final_rss_mb"] - off["final_rss_mb"]
    heap_saved = on["peak_heap_mb"] - off["peak_heap_mb"]
    if rss_saved > 0:
        print(f"\n  -> Disabling arena saves {rss_saved:.1f} MB idle RSS")
    if heap_saved > 0:
        print(f"  -> Disabling arena saves {heap_saved:.1f} MB peak heap")


if __name__ == "__main__":
    main()

KRRT7 added 2 commits March 16, 2026 16:46
The CPU memory arena pre-allocates a pool that persists for the session
lifetime, even between inference calls. Disabling it allows the OS to
reclaim memory when models are idle.
@KRRT7
Copy link
Author

KRRT7 commented Mar 17, 2026

Arena strategy alternatives — benchmark results

Tested 5 approaches on layout-parser-paper.pdf (16 pages, ORT 1.24.1) to see if we can get idle RSS savings without the peak RSS cost:

Approach Peak RSS vs baseline Idle RSS vs baseline Throughput
enable_cpu_mem_arena = False (this PR) +3–7% -13 to -16% ~same
kSameAsRequested strategy ~same ~same ~same
Session disposal after doc ~same unreliable ~same
kSameAsRequested + arena shrinkage +12% -0 to -7% 30–66% slower

Arena OFF is the only approach that consistently delivers idle savings. The alternatives either don't work or are actively worse.

@KRRT7
Copy link
Author

KRRT7 commented Mar 19, 2026

closing in favor of #484

@KRRT7 KRRT7 closed this Mar 19, 2026
@KRRT7 KRRT7 deleted the disable-onnx-cpu-mem-arena branch March 19, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant