-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Feature Request: Release GIL during session initialization to enable parallel multi-GPU loading
Description
When initializing multiple ONNX Runtime inference sessions in parallel using Python threads (e.g., one session per GPU), the sessions initialize mostly sequentially rather than in parallel. Our benchmarks suggest the GIL is held during session initialization, preventing true parallelism.
The run() methods already release the GIL with py::gil_scoped_release, but the session initialization path does not.
Question for maintainers: Are there known technical reasons why the GIL cannot be released during session initialization? We'd like to understand if this is feasible before investing more time.
Motivation
For multi-GPU deployments with large models, session initialization time is significant. Users expect that initializing N sessions on N GPUs using N threads would take roughly the same time as initializing 1 session, but currently it takes closer to N times as long.
Benchmark Results
Test environment: 8x NVIDIA A100-SXM4-80GB, ONNX Runtime 1.23.2, Python 3.x
Phi-3-mini model (7.2GB):
| Method | Wall Time | Speedup | Efficiency |
|---|---|---|---|
| Sequential (4 GPUs) | 10.87s | baseline | - |
| Threading (4 threads) | 7.48s | 1.45x | 15.1% |
| Multiprocessing (4 processes) | 4.38s | 2.48x | 49.4% |
Threading achieves only 15% efficiency (expected: ~100% for embarrassingly parallel workload). Multiprocessing works around the issue but adds complexity and memory overhead.
Code Analysis
Looking at onnxruntime/python/onnxruntime_pybind_state.cc:
GIL IS released for run():
{
// release GIL to allow multiple python threads to invoke Run() in parallel.
py::gil_scoped_release release;
OrtPybindThrowIfError(sess->GetSessionHandle()->Run(...));
}GIL is NOT released for session initialization:
py::init(model loading)initialize_session(EP registration, graph optimization)
Why This Isn't a Trivial Fix
We recognize this isn't as simple as adding one line. The initialization code accesses Python objects before calling Load():
1. Session options access (lines 2507-2512):
if (CheckIfUsingGlobalThreadPool() && so.value.use_per_session_threads) { ... }2. Custom op domains (lines 1303-1312):
static void RegisterCustomOpDomains(PyInferenceSession* sess, const PySessionOptions& so) {
if (!so.custom_op_domains_.empty()) { // Python object access
for (size_t i = 0; i < so.custom_op_domains_.size(); ++i) {
custom_op_domains.emplace_back(so.custom_op_domains_[i]);
}
...
}
}A proper fix would likely require:
- Extract needed data from Python objects while holding the GIL
- Release GIL
- Call
Load()andInitialize()(pure C++) - GIL auto-reacquires on scope exit
Additional Considerations
- Custom ops: If custom ops can have Python callbacks during initialization, those would need GIL
- Thread safety: Internal state in
Load()/Initialize()would need to be thread-safe - Free-threaded Python: We noticed
py::mod_gil_not_used()support is being added - does this relate?
Questions for Maintainers
- Are there known blockers that prevent releasing the GIL during session initialization?
- Would this be considered a welcome contribution if someone submitted a PR?
- Is there internal thread-safety in
InferenceSession::Load()andInitialize()that would support concurrent calls?
Precedent
Issue #11246 identified that run_with_iobinding() was missing gil_scoped_release, and it was fixed the same day in PR #11248. However, we understand session initialization is more complex.
Workaround
Currently, users must use multiprocessing:
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
ctx = multiprocessing.get_context('spawn')
with ProcessPoolExecutor(max_workers=num_gpus, mp_context=ctx) as executor:
sessions = list(executor.map(init_session, range(num_gpus)))This works but adds complexity, memory overhead (full process per GPU), and requires shared memory for data transfer.
Environment
- ONNX Runtime version: 1.23.2
- Python version: 3.x
- OS: Linux
- Hardware: Multi-GPU (tested on 8x A100)