perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481
Closed
KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
Closed
perf: disable ONNX Runtime CPU memory arena to reduce idle memory#481KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
KRRT7 wants to merge 2 commits intoUnstructured-IO:mainfrom
Conversation
The CPU memory arena pre-allocates a pool that persists for the session lifetime, even between inference calls. Disabling it allows the OS to reclaim memory when models are idle.
Author
Arena strategy alternatives — benchmark resultsTested 5 approaches on
Arena OFF is the only approach that consistently delivers idle savings. The alternatives either don't work or are actively worse. |
Author
|
closing in favor of #484 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Disables ONNX Runtime's CPU memory arena (
enable_cpu_mem_arena = False) forboth YoloX and Detectron2 ONNX sessions to reduce idle memory usage.
The arena pre-allocates a memory pool on the first
session.run()and neverreturns it to the OS by default
(docs). Since we run
one request at a time, the pool sits idle between
inference calls with no benefit. Disabling it lets the OS reclaim memory after
each call.
Motivation
ONNX Runtime's default arena extend strategy (
kNextPowerOfTwo) canaggressively over-allocate — an ORT maintainer confirmed in
microsoft/onnxruntime#11627
that this "might allocate more memory than needed, which could be a waste"
(#11118,
#13504,
#22271).
Benchmarks
We ran benchmarks across various PDFs. Representative results from a 16-page
document using memray in isolated subprocesses:
YoloX
YoloX Quantized
Key takeaways:
A short reproducible script is included below.
Changes
unstructured_inference/models/yolox.py— createSessionOptionswithenable_cpu_mem_arena = Falsebefore creating theInferenceSessionunstructured_inference/models/detectron2onnx.py— same changeBenchmark script
benchmarks/bench_onnx_memory_arena.py (click to expand)