Avoid repetitive creation of fp4/fp8 native-custom-op domains for NvTensorRtRtx EP#27192
Avoid repetitive creation of fp4/fp8 native-custom-op domains for NvTensorRtRtx EP#27192vishalpandya1990 wants to merge 4 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the NvTensorRTRTX execution provider's custom ops initialization to improve efficiency and correctness. It introduces a flag-based approach to avoid redundant initialization of FP4/FP8 native custom ops and removes potentially dangerous manual deletion of statically-allocated objects.
Changes:
- Added
native_custom_ops_initializedflag to track whether native custom ops (TRT_FP4DynamicQuantize, TRT_FP8QuantizeLinear, TRT_FP8DequantizeLinear) have been created - Restructured control flow to add already-initialized native ops to the domain list without re-creating them
- Made release functions no-ops since the custom op domains are static objects managed by unique_ptr with static storage duration
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Outdated
Show resolved
Hide resolved
304db7c to
3e87df1
Compare
I have synced the branch. |
|
@yuslepukhin @tianleiwu Can I get a review for this? |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Show resolved
Hide resolved
3e87df1 to
88bf125
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
The changes look correct and robust.
Thread Safety: The use of static std::mutex effectively protects the initialization logic.
Resource Management: Removing the manual delete corresponds correctly to the static ownership model.
Logic Correctness: The separation of native_custom_ops_initialized check ensures that native ops are returned correctly even if custom_op_domain is empty, while preventing duplicate initialization.
The PR improves stability and prevents potential memory corruption.
|
Error in React Native CI Pipeline / React Native CI Android (pull_request) doesn't look related. Raw Logs - link
|
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_custom_ops.cc
Outdated
Show resolved
Hide resolved
| // Callers receive raw pointers via .get(). | ||
| // 1. Manually deleting them would cause a double-free when the static unique_ptrs are destroyed at program exit. | ||
| // 2. Resetting the static unique_ptrs is also unsafe because other EP instances or InferenceSession objects | ||
| // may still hold raw pointers to these same objects (handed out via domain_list). |
There was a problem hiding this comment.
Would this indicate a different problem of someone calling to destroy objects that are in-use? Should we fix that bug?
Another question, static objects would be destroyed just prior to this DLL being unloaded. We want to make sure that the entities being destroyed do not refer to another DLL that could potentially be unloaded first.
It is for the reason people usually introduce a special API to have control of the process and to destroy things at a safe time and not to delegate it to a OS dependent specifics when shared objects are unloaded and the order of static destruction.
There was a problem hiding this comment.
Would this indicate a different problem of someone calling to destroy objects that are in-use?
Yes, this is a potential use-after-free scenario. I think it should get mitigated with current change.
We want to make sure that the entities being destroyed do not refer to another DLL
usually introduce a special API to have control of the process and to destroy things at a safe time
I see your point. Usually, we could have ref-counted concerned objects for handling this (or, make them part of EP instance, or session to avoid shared usage). However, I believe no cross-DLL memory is actually accessed during destruction today.
I think it will be better to decouple current change about avoid-repetition handling with any potential design changes on this part. Please let me know if this sounds okay to you.
There was a problem hiding this comment.
Here my take on this. There is not a firmly defined policy here on handling these objects. I think we need to make a choice here:
- Remove the Release functions and give away shared_ptrs OR
- Use the Release functions so client code can destroy the objects when it KNOWS that raw pointers are no longer in use.
Until that happens, this is going to be never-ending chasing of the tail with different OS dependent issues.
88bf125 to
3506183
Compare
Description
Motivation and Context