Skip to content

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading #27063

@tolleybot

Description

@tolleybot

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading

Description

When initializing multiple ONNX Runtime inference sessions in parallel using Python threads (e.g., one session per GPU), the sessions initialize mostly sequentially rather than in parallel. Our benchmarks suggest the GIL is held during session initialization, preventing true parallelism.

The run() methods already release the GIL with py::gil_scoped_release, but the session initialization path does not.

Question for maintainers: Are there known technical reasons why the GIL cannot be released during session initialization? We'd like to understand if this is feasible before investing more time.

Motivation

For multi-GPU deployments with large models, session initialization time is significant. Users expect that initializing N sessions on N GPUs using N threads would take roughly the same time as initializing 1 session, but currently it takes closer to N times as long.

Benchmark Results

Test environment: 8x NVIDIA A100-SXM4-80GB, ONNX Runtime 1.23.2, Python 3.x

Phi-3-mini model (7.2GB):

Method Wall Time Speedup Efficiency
Sequential (4 GPUs) 10.87s baseline -
Threading (4 threads) 7.48s 1.45x 15.1%
Multiprocessing (4 processes) 4.38s 2.48x 49.4%

Threading achieves only 15% efficiency (expected: ~100% for embarrassingly parallel workload). Multiprocessing works around the issue but adds complexity and memory overhead.

Code Analysis

Looking at onnxruntime/python/onnxruntime_pybind_state.cc:

GIL IS released for run():

{
  // release GIL to allow multiple python threads to invoke Run() in parallel.
  py::gil_scoped_release release;
  OrtPybindThrowIfError(sess->GetSessionHandle()->Run(...));
}

GIL is NOT released for session initialization:

Why This Isn't a Trivial Fix

We recognize this isn't as simple as adding one line. The initialization code accesses Python objects before calling Load():

1. Session options access (lines 2507-2512):

if (CheckIfUsingGlobalThreadPool() && so.value.use_per_session_threads) { ... }

2. Custom op domains (lines 1303-1312):

static void RegisterCustomOpDomains(PyInferenceSession* sess, const PySessionOptions& so) {
  if (!so.custom_op_domains_.empty()) {  // Python object access
    for (size_t i = 0; i < so.custom_op_domains_.size(); ++i) {
      custom_op_domains.emplace_back(so.custom_op_domains_[i]);
    }
    ...
  }
}

A proper fix would likely require:

  1. Extract needed data from Python objects while holding the GIL
  2. Release GIL
  3. Call Load() and Initialize() (pure C++)
  4. GIL auto-reacquires on scope exit

Additional Considerations

  • Custom ops: If custom ops can have Python callbacks during initialization, those would need GIL
  • Thread safety: Internal state in Load()/Initialize() would need to be thread-safe
  • Free-threaded Python: We noticed py::mod_gil_not_used() support is being added - does this relate?

Questions for Maintainers

  1. Are there known blockers that prevent releasing the GIL during session initialization?
  2. Would this be considered a welcome contribution if someone submitted a PR?
  3. Is there internal thread-safety in InferenceSession::Load() and Initialize() that would support concurrent calls?

Precedent

Issue #11246 identified that run_with_iobinding() was missing gil_scoped_release, and it was fixed the same day in PR #11248. However, we understand session initialization is more complex.

Workaround

Currently, users must use multiprocessing:

import multiprocessing
from concurrent.futures import ProcessPoolExecutor

ctx = multiprocessing.get_context('spawn')
with ProcessPoolExecutor(max_workers=num_gpus, mp_context=ctx) as executor:
    sessions = list(executor.map(init_session, range(num_gpus)))

This works but adds complexity, memory overhead (full process per GPU), and requires shared memory for data transfer.

Environment

  • ONNX Runtime version: 1.23.2
  • Python version: 3.x
  • OS: Linux
  • Hardware: Multi-GPU (tested on 8x A100)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions