Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading

# Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading

## Description

When initializing multiple ONNX Runtime inference sessions in parallel using Python threads (e.g., one session per GPU), the sessions initialize mostly sequentially rather than in parallel. Our benchmarks suggest the GIL is held during session initialization, preventing true parallelism.

The `run()` methods already release the GIL with `py::gil_scoped_release`, but the session initialization path does not.

**Question for maintainers:** Are there known technical reasons why the GIL cannot be released during session initialization? We'd like to understand if this is feasible before investing more time.

## Motivation

For multi-GPU deployments with large models, session initialization time is significant. Users expect that initializing N sessions on N GPUs using N threads would take roughly the same time as initializing 1 session, but currently it takes closer to N times as long.

### Benchmark Results

**Test environment:** 8x NVIDIA A100-SXM4-80GB, ONNX Runtime 1.23.2, Python 3.x

**Phi-3-mini model (7.2GB):**

| Method | Wall Time | Speedup | Efficiency |
|--------|-----------|---------|------------|
| Sequential (4 GPUs) | 10.87s | baseline | - |
| Threading (4 threads) | 7.48s | 1.45x | **15.1%** |
| Multiprocessing (4 processes) | 4.38s | 2.48x | 49.4% |

Threading achieves only 15% efficiency (expected: ~100% for embarrassingly parallel workload). Multiprocessing works around the issue but adds complexity and memory overhead.

## Code Analysis

Looking at [`onnxruntime/python/onnxruntime_pybind_state.cc`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc):

**GIL IS released for [`run()`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc#L2594-L2602):**
```cpp
{
  // release GIL to allow multiple python threads to invoke Run() in parallel.
  py::gil_scoped_release release;
  OrtPybindThrowIfError(sess->GetSessionHandle()->Run(...));
}
```

**GIL is NOT released for session initialization:**

- [`py::init` (model loading)](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc#L2503-L2541)
- [`initialize_session`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc#L2543-L2560) (EP registration, graph optimization)

### Why This Isn't a Trivial Fix

We recognize this isn't as simple as adding one line. The initialization code accesses Python objects before calling `Load()`:

**1. Session options access ([lines 2507-2512](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc#L2507-L2512)):**
```cpp
if (CheckIfUsingGlobalThreadPool() && so.value.use_per_session_threads) { ... }
```

**2. Custom op domains ([lines 1303-1312](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_state.cc#L1303-L1312)):**
```cpp
static void RegisterCustomOpDomains(PyInferenceSession* sess, const PySessionOptions& so) {
  if (!so.custom_op_domains_.empty()) {  // Python object access
    for (size_t i = 0; i < so.custom_op_domains_.size(); ++i) {
      custom_op_domains.emplace_back(so.custom_op_domains_[i]);
    }
    ...
  }
}
```

A proper fix would likely require:
1. Extract needed data from Python objects while holding the GIL
2. Release GIL
3. Call `Load()` and `Initialize()` (pure C++)
4. GIL auto-reacquires on scope exit

### Additional Considerations

- **Custom ops:** If custom ops can have Python callbacks during initialization, those would need GIL
- **Thread safety:** Internal state in `Load()`/`Initialize()` would need to be thread-safe
- **Free-threaded Python:** We noticed [`py::mod_gil_not_used()`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/onnxruntime_pybind_module.cc#L107) support is being added - does this relate?

## Questions for Maintainers

1. Are there known blockers that prevent releasing the GIL during session initialization?
2. Would this be considered a welcome contribution if someone submitted a PR?
3. Is there internal thread-safety in `InferenceSession::Load()` and `Initialize()` that would support concurrent calls?

## Precedent

[Issue #11246](https://github.com/microsoft/onnxruntime/issues/11246) identified that `run_with_iobinding()` was missing `gil_scoped_release`, and it was fixed the same day in [PR #11248](https://github.com/microsoft/onnxruntime/pull/11248). However, we understand session initialization is more complex.

## Workaround

Currently, users must use multiprocessing:

```python
import multiprocessing
from concurrent.futures import ProcessPoolExecutor

ctx = multiprocessing.get_context('spawn')
with ProcessPoolExecutor(max_workers=num_gpus, mp_context=ctx) as executor:
    sessions = list(executor.map(init_session, range(num_gpus)))
```

This works but adds complexity, memory overhead (full process per GPU), and requires shared memory for data transfer.

## Environment

- ONNX Runtime version: 1.23.2
- Python version: 3.x
- OS: Linux
- Hardware: Multi-GPU (tested on 8x A100)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading #27063

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading

Description

Motivation

Benchmark Results

Code Analysis

Why This Isn't a Trivial Fix

Additional Considerations

Questions for Maintainers

Precedent

Workaround

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Wall Time	Speedup	Efficiency
Sequential (4 GPUs)	10.87s	baseline	-
Threading (4 threads)	7.48s	1.45x	15.1%
Multiprocessing (4 processes)	4.38s	2.48x	49.4%

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading #27063

Description

Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading

Description

Motivation

Benchmark Results

Code Analysis

Why This Isn't a Trivial Fix

Additional Considerations

Questions for Maintainers

Precedent

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions