mem: disable ONNX mem_pattern and cpu_mem_arena on inference sessions#484
Open
KRRT7 wants to merge 1 commit intoUnstructured-IO:mainfrom
Open
mem: disable ONNX mem_pattern and cpu_mem_arena on inference sessions#484KRRT7 wants to merge 1 commit intoUnstructured-IO:mainfrom
KRRT7 wants to merge 1 commit intoUnstructured-IO:mainfrom
Conversation
Set enable_mem_pattern=False and enable_cpu_mem_arena=False on SessionOptions for both YoloX and Detectron2 ONNX sessions. These flags control pre-allocation strategies that trade memory for speed on repeated inference. With both disabled, peak memory drops ~36% (553→351 MB) on the YoloX model with negligible latency impact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disable
enable_mem_patternandenable_cpu_mem_arenaonSessionOptionsfor both YoloX and Detectron2 ONNX sessions.By default ONNX Runtime pre-allocates a memory arena and traces allocation patterns from the first
Run()to reuse memory on subsequent calls. This trades higher idle memory for faster repeated inference. For our use-case (one model loaded per worker, infrequent re-inference on the same session), the arena and pattern buffer are mostly wasted — they keep ~200 MB of pre-allocated native memory alive between requests.With both disabled, idle session memory drops significantly with negligible latency impact on inference.
Benchmark
Measured with memray (
memray run+memray stats --json), 3 inference iterations per configuration, on Apple M3 Max / Python 3.12. Uses the actualyolox_l0.05.onnxmodel with 1700x2200 input (letter page at 200 DPI).Reproduce
bench_onnx_mem_pattern.py