Skip to content

Commit 879ec03

Browse files
authored
Add enable_profiling in runoptions (#26846)
### Description Support run-level profiling This PR adds support for profiling individual Run executions, similar to session-level profiling. Developers can enable run-level profiling by setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once the run completes, a JSON profiling file will be saved using profile_file_prefix + timestamp. <img width="514" height="281" alt="png (2)" src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af" /> ### Key Changes 1. Introduced a local variable `run_profiler` in `InferenceSession::Run`, which is destroyed after the run completes. Using a dedicated profiler per run ensures that profiling data is isolated and prevents interleaving or corruption across runs. 2. To maintain accurate execution time when both session-level and run-level profiling are enabled, overloaded `Start` and `EndTimeAndRecordEvent` functions have been added. These allow the caller to provide timestamps instead of relying on `std::chrono::high_resolution_clock::now()`, avoiding potential timing inaccuracies. 3. Added a TLS variable `tls_run_profiler_` to support run-level profiling with WebGPU Execution Provider (EP). This ensures that when multiple threads enable run-level profiling, each thread logs only to its own WebGPU profiler, keeping thread-specific data isolated. 4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to prevent conflicts when profiling multiple consecutive runs. ### Motivation and Context Previously, profiling only for session level. Sometimes developer want to profile for specfic run . so the PR comes. ### Some details When profiling is enabled via RunOptions, it should ideally collect two types of events: 1. Profiler events Used to calculate the CPU execution time of each operator. 2. Execution Provider (EP) profiler events Used to measure GPU kernel execution time. Unlike session-level, we need to ensure the collecting events is correct for multiple thread scenario. For 1, this can be supported easily(sequential_executor.cc). We use a thread-local storage (TLS) variable, RunLevelState (defined in profiler.h), to maintain run-level profiling state for each thread. For 2, each Execution Provider (EP) has its own profiler implementation, and each EP must ensure correct behavior under run-level profiling. This PR ensures that the WebGPU profiler works correctly with run-level profiling. # Test Cases | Scenario | Example | Expected Result | |---------|---------|-----------------| | Concurrent runs on the same session with different run-level profiling settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2: `sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({ enable_profiling: true })` | Two trace JSON files are generated: one for `t1` and one for `t3`. | | Run-level profiling enabled together with session-level profiling| `sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({ enable_profiling: true })` | Two trace JSON files are generated: one corresponding to session-level profiling and one corresponding to run-level profiling. |
1 parent a5ba2ba commit 879ec03

24 files changed

+331
-88
lines changed

include/onnxruntime/core/common/profiler_common.h

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
#pragma once
55

66
#include "core/common/common.h"
7+
#include "core/common/inlined_containers.h"
78

89
#include <string>
9-
#include <unordered_map>
1010

1111
namespace onnxruntime {
1212
namespace profiling {
@@ -35,7 +35,7 @@ struct EventRecord {
3535
std::string&& event_name,
3636
long long time_stamp,
3737
long long duration,
38-
std::unordered_map<std::string, std::string>&& event_args)
38+
InlinedHashMap<std::string, std::string>&& event_args)
3939
: cat(category),
4040
pid(process_id),
4141
tid(thread_id),
@@ -50,7 +50,7 @@ struct EventRecord {
5050
const std::string& event_name,
5151
long long time_stamp,
5252
long long duration,
53-
const std::unordered_map<std::string, std::string>& event_args)
53+
const InlinedHashMap<std::string, std::string>& event_args)
5454
: cat(category),
5555
pid(process_id),
5656
tid(thread_id),
@@ -70,7 +70,7 @@ struct EventRecord {
7070
std::string name{};
7171
long long ts = 0;
7272
long long dur = 0;
73-
std::unordered_map<std::string, std::string> args{};
73+
InlinedHashMap<std::string, std::string> args{};
7474
};
7575

7676
using Events = std::vector<EventRecord>;

include/onnxruntime/core/framework/run_options.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,14 @@ struct OrtRunOptions {
3535
// So it is possible that only some of the nodes are executed.
3636
bool only_execute_path_to_fetches = false;
3737

38+
// Set to 'true' to enable profiling for this run.
39+
bool enable_profiling = false;
40+
41+
// File prefix for profiling result for this run.
42+
// The actual filename will be: <profile_file_prefix>_<timestamp>.json
43+
// Only used when enable_profiling is true.
44+
std::basic_string<ORTCHAR_T> profile_file_prefix = ORT_TSTR("onnxruntime_run_profile");
45+
3846
#ifdef ENABLE_TRAINING
3947
// Used by onnxruntime::training::TrainingSession. This class is now deprecated.
4048
// Delete training_mode when TrainingSession is deleted.

include/onnxruntime/core/session/onnxruntime_c_api.h

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7195,6 +7195,28 @@ struct OrtApi {
71957195
* \since 1.24
71967196
*/
71977197
ORT_API_T(void, RunOptionsSetSyncStream, _Inout_ OrtRunOptions* options, _In_ OrtSyncStream* sync_stream);
7198+
7199+
/** \brief Enable profiling for this run
7200+
*
7201+
* \param[in] options
7202+
* \param[in] profile_file_prefix The prefix for the profile file. The actual filename will be:
7203+
* <profile_file_prefix>_<timestamp>.json
7204+
*
7205+
* \snippet{doc} snippets.dox OrtStatus Return Value
7206+
*
7207+
* \since Version 1.25.
7208+
*/
7209+
ORT_API2_STATUS(RunOptionsEnableProfiling, _Inout_ OrtRunOptions* options, _In_ const ORTCHAR_T* profile_file_prefix);
7210+
7211+
/** \brief Disable profiling for this run
7212+
*
7213+
* \param[in] options
7214+
*
7215+
* \snippet{doc} snippets.dox OrtStatus Return Value
7216+
*
7217+
* \since Version 1.25.
7218+
*/
7219+
ORT_API2_STATUS(RunOptionsDisableProfiling, _Inout_ OrtRunOptions* options);
71987220
};
71997221

72007222
/*

include/onnxruntime/core/session/onnxruntime_cxx_api.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1382,6 +1382,19 @@ struct RunOptions : detail::Base<OrtRunOptions> {
13821382
* \param stream The OrtSyncStream to associate with these run options. May be nullptr to clear.
13831383
*/
13841384
RunOptions& SetSyncStream(OrtSyncStream* stream);
1385+
1386+
/** \brief Enable profiling for this run
1387+
*
1388+
* Wraps OrtApi::RunOptionsEnableProfiling
1389+
* \param profile_file_prefix The prefix for the profile file
1390+
*/
1391+
RunOptions& EnableProfiling(const ORTCHAR_T* profile_file_prefix);
1392+
1393+
/** \brief Disable profiling for this run
1394+
*
1395+
* Wraps OrtApi::RunOptionsDisableProfiling
1396+
*/
1397+
RunOptions& DisableProfiling();
13851398
};
13861399

13871400
namespace detail {

include/onnxruntime/core/session/onnxruntime_cxx_inline.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1077,6 +1077,16 @@ inline RunOptions& RunOptions::SetSyncStream(OrtSyncStream* stream) {
10771077
return *this;
10781078
}
10791079

1080+
inline RunOptions& RunOptions::EnableProfiling(const ORTCHAR_T* profile_file_prefix) {
1081+
ThrowOnError(GetApi().RunOptionsEnableProfiling(p_, profile_file_prefix));
1082+
return *this;
1083+
}
1084+
1085+
inline RunOptions& RunOptions::DisableProfiling() {
1086+
ThrowOnError(GetApi().RunOptionsDisableProfiling(p_));
1087+
return *this;
1088+
}
1089+
10801090
inline ModelCompilationOptions::ModelCompilationOptions(const Env& env, const SessionOptions& session_options) {
10811091
ThrowOnError(GetCompileApi().CreateModelCompilationOptionsFromSessionOptions(env, session_options, &this->p_));
10821092
}

onnxruntime/core/common/profiler.cc

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -71,16 +71,17 @@ template void Profiler::StartProfiling<char>(const std::basic_string<char>& file
7171
template void Profiler::StartProfiling<wchar_t>(const std::basic_string<wchar_t>& file_name);
7272
#endif
7373

74-
void Profiler::EndTimeAndRecordEvent(EventCategory category,
75-
const std::string& event_name,
76-
const TimePoint& start_time,
77-
const std::initializer_list<std::pair<std::string, std::string>>& event_args,
78-
bool /*sync_gpu*/) {
74+
void Profiler::EndTimeAndRecordEvent(
75+
EventCategory category,
76+
const std::string& event_name,
77+
const TimePoint& start_time,
78+
InlinedHashMap<std::string, std::string> event_args,
79+
bool /*sync_gpu*/) {
7980
long long dur = TimeDiffMicroSeconds(start_time);
8081
long long ts = TimeDiffMicroSeconds(profiling_start_time_, start_time);
8182

8283
EventRecord event(category, logging::GetProcessId(),
83-
logging::GetThreadId(), event_name, ts, dur, {event_args.begin(), event_args.end()});
84+
logging::GetThreadId(), event_name, ts, dur, std::move(event_args));
8485
if (profile_with_logger_) {
8586
custom_logger_->SendProfileEvent(event);
8687
} else {

onnxruntime/core/common/profiler.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ class Profiler {
7777
void EndTimeAndRecordEvent(EventCategory category,
7878
const std::string& event_name,
7979
const TimePoint& start_time,
80-
const std::initializer_list<std::pair<std::string, std::string>>& event_args = {},
80+
InlinedHashMap<std::string, std::string> event_args = {},
8181
bool sync_gpu = false);
8282

8383
/*

onnxruntime/core/framework/run_options.cc

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,16 @@ ORT_API_STATUS_IMPL(OrtApis::RunOptionsAddActiveLoraAdapter, _Inout_ OrtRunOptio
8484
return nullptr;
8585
API_IMPL_END
8686
}
87+
88+
ORT_API_STATUS_IMPL(OrtApis::RunOptionsEnableProfiling, _Inout_ OrtRunOptions* options,
89+
_In_ const ORTCHAR_T* profile_file_prefix) {
90+
options->enable_profiling = true;
91+
options->profile_file_prefix = profile_file_prefix;
92+
return nullptr;
93+
}
94+
95+
ORT_API_STATUS_IMPL(OrtApis::RunOptionsDisableProfiling, _Inout_ OrtRunOptions* options) {
96+
options->enable_profiling = false;
97+
options->profile_file_prefix.clear();
98+
return nullptr;
99+
}

onnxruntime/core/framework/sequential_executor.cc

Lines changed: 76 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
#include <vector>
99
#include <sstream>
1010
#include "core/common/common.h"
11+
#include "core/common/inlined_containers.h"
1112
#include "core/common/logging/logging.h"
1213
#include "core/framework/allocation_planner.h"
1314
#include "core/framework/execution_frame.h"
@@ -155,8 +156,8 @@ std::string ComposeSeriesName(const GraphViewer& graph_viewer) {
155156
class SessionScope {
156157
public:
157158
friend class KernelScope;
158-
SessionScope(const SessionState& session_state, const ExecutionFrame& frame)
159-
: session_state_(session_state)
159+
SessionScope(const SessionState& session_state, const ExecutionFrame& frame, profiling::Profiler* run_profiler)
160+
: session_state_(session_state), run_profiler_(run_profiler)
160161
#if !defined(ORT_MINIMAL_BUILD) && defined(ORT_MEMORY_PROFILE)
161162
,
162163
frame_(frame)
@@ -173,13 +174,10 @@ class SessionScope {
173174
#endif
174175
#ifdef DEBUG_NODE_INPUTS_OUTPUTS
175176
,
176-
dump_context_{
177-
session_state_.GetGraphExecutionCounter(), 0}
177+
dump_context_{session_state_.GetGraphExecutionCounter(), 0}
178178
#endif
179179
{
180-
if (session_state_.Profiler().IsEnabled()) {
181-
session_start_ = session_state.Profiler().Start();
182-
}
180+
session_start_ = StartProfilingIfEnabled();
183181

184182
auto& logger = session_state_.Logger();
185183
VLOGS(logger, 0) << "Begin execution";
@@ -225,9 +223,8 @@ class SessionScope {
225223
}
226224
#endif
227225

228-
if (session_state_.Profiler().IsEnabled()) {
229-
session_state_.Profiler().EndTimeAndRecordEvent(profiling::SESSION_EVENT, "SequentialExecutor::Execute", session_start_);
230-
}
226+
StopProfilingIfEnabled(profiling::SESSION_EVENT, "SequentialExecutor::Execute", session_start_);
227+
231228
#if !defined(ORT_MINIMAL_BUILD) && defined(ORT_MEMORY_PROFILE)
232229
auto& logger = session_state_.Logger();
233230
for (auto i : frame_.GetStaticMemorySizeInfo()) {
@@ -252,8 +249,45 @@ class SessionScope {
252249
}
253250
#endif
254251

252+
bool IsRunProfilingEnabled() const {
253+
return run_profiler_ && run_profiler_->IsEnabled();
254+
}
255+
256+
void StopProfilingIfEnabled(profiling::EventCategory category,
257+
const std::string& event_name,
258+
const TimePoint& start_time,
259+
InlinedHashMap<std::string, std::string> event_args = {}) {
260+
const bool session_profiling_enabled = session_state_.Profiler().IsEnabled();
261+
const bool run_profiling_enabled = IsRunProfilingEnabled();
262+
263+
if (session_profiling_enabled) {
264+
session_state_.Profiler().EndTimeAndRecordEvent(category,
265+
event_name,
266+
start_time,
267+
std::move(event_args));
268+
} else if (run_profiling_enabled) {
269+
run_profiler_->EndTimeAndRecordEvent(category,
270+
event_name,
271+
start_time,
272+
std::move(event_args));
273+
}
274+
}
275+
276+
TimePoint StartProfilingIfEnabled() {
277+
const bool session_profiling_enabled = session_state_.Profiler().IsEnabled();
278+
const bool run_profiling_enabled = IsRunProfilingEnabled();
279+
280+
if (session_profiling_enabled) {
281+
return session_state_.Profiler().Start();
282+
} else if (run_profiling_enabled) {
283+
return run_profiler_->Start();
284+
}
285+
return TimePoint{};
286+
}
287+
255288
private:
256289
const SessionState& session_state_;
290+
profiling::Profiler* run_profiler_;
257291
TimePoint session_start_;
258292
#if !defined(ORT_MINIMAL_BUILD) && defined(ORT_MEMORY_PROFILE)
259293
const ExecutionFrame& frame_;
@@ -340,16 +374,17 @@ class KernelScope {
340374
utils::DumpNodeInputs(dump_context_, kernel_context_, kernel_.Node(), session_state_, session_scope_.dump_analysis_);
341375
#endif
342376

343-
#ifdef ENABLE_NVTX_PROFILE
344-
node_compute_range_.Begin();
345-
#endif
377+
const bool session_profiling_enabled = session_state_.Profiler().IsEnabled();
378+
const bool run_profiling_enabled = session_scope_.IsRunProfilingEnabled();
346379

347-
if (session_state_.Profiler().IsEnabled()) {
380+
if (session_profiling_enabled || run_profiling_enabled) {
348381
auto& node = kernel.Node();
349382
node_name_ = node.Name().empty() ? MakeString(node.OpType(), "_", node.Index()) : node.Name();
350383
concurrency::ThreadPool::StartProfiling(session_state_.GetThreadPool());
351384
VLOGS(session_state_.Logger(), 1) << "Computing kernel: " << node_name_;
352-
kernel_begin_time_ = session_state_.Profiler().Start();
385+
386+
kernel_begin_time_ = session_scope_.StartProfilingIfEnabled();
387+
353388
CalculateTotalInputSizes(&kernel_context, &kernel_,
354389
input_activation_sizes_, input_parameter_sizes_,
355390
node_name_, input_type_shape_);
@@ -363,26 +398,30 @@ class KernelScope {
363398
node_compute_range_.End();
364399
#endif
365400

366-
if (session_state_.Profiler().IsEnabled()) {
367-
auto& profiler = session_state_.Profiler();
401+
const bool session_profiling_enabled = session_state_.Profiler().IsEnabled();
402+
const bool run_profiling_enabled = session_scope_.IsRunProfilingEnabled();
403+
404+
if (session_profiling_enabled || run_profiling_enabled) {
368405
std::string output_type_shape_;
369406
CalculateTotalOutputSizes(&kernel_context_, total_output_sizes_, node_name_, output_type_shape_);
370-
profiler.EndTimeAndRecordEvent(profiling::NODE_EVENT,
371-
node_name_ + "_kernel_time",
372-
kernel_begin_time_,
373-
// Log additional operation args / info.
374-
{
375-
{"op_name", kernel_.KernelDef().OpName()},
376-
{"provider", kernel_.KernelDef().Provider()},
377-
{"node_index", std::to_string(kernel_.Node().Index())},
378-
{"activation_size", std::to_string(input_activation_sizes_)},
379-
{"parameter_size", std::to_string(input_parameter_sizes_)},
380-
{"output_size", std::to_string(total_output_sizes_)},
381-
{"input_type_shape", input_type_shape_},
382-
{"output_type_shape", output_type_shape_},
383-
{"thread_scheduling_stats",
384-
concurrency::ThreadPool::StopProfiling(session_state_.GetThreadPool())},
385-
});
407+
408+
InlinedHashMap<std::string, std::string> event_args = {
409+
{"op_name", kernel_.KernelDef().OpName()},
410+
{"provider", kernel_.KernelDef().Provider()},
411+
{"node_index", std::to_string(kernel_.Node().Index())},
412+
{"activation_size", std::to_string(input_activation_sizes_)},
413+
{"parameter_size", std::to_string(input_parameter_sizes_)},
414+
{"output_size", std::to_string(total_output_sizes_)},
415+
{"input_type_shape", input_type_shape_},
416+
{"output_type_shape", output_type_shape_},
417+
{"thread_scheduling_stats",
418+
concurrency::ThreadPool::StopProfiling(session_state_.GetThreadPool())},
419+
};
420+
421+
session_scope_.StopProfilingIfEnabled(profiling::NODE_EVENT,
422+
node_name_ + "_kernel_time",
423+
kernel_begin_time_,
424+
std::move(event_args));
386425
}
387426

388427
#ifdef ONNXRUNTIME_ENABLE_INSTRUMENT
@@ -588,7 +627,8 @@ onnxruntime::Status ExecuteThePlan(const SessionState& session_state, gsl::span<
588627
#endif
589628
const bool& terminate_flag,
590629
const bool only_execute_path_to_fetches,
591-
bool single_thread_mode) {
630+
bool single_thread_mode,
631+
profiling::Profiler* run_profiler) {
592632
auto* execution_plan = session_state.GetExecutionPlan();
593633
VLOGS(logger, 0) << "Number of streams: " << execution_plan->execution_plan.size();
594634
int32_t valid_streams = 0;
@@ -631,7 +671,7 @@ onnxruntime::Status ExecuteThePlan(const SessionState& session_state, gsl::span<
631671
ORT_UNUSED_PARAMETER(only_execute_path_to_fetches);
632672
#endif
633673

634-
SessionScope session_scope(session_state, ctx.GetExecutionFrame());
674+
SessionScope session_scope(session_state, ctx.GetExecutionFrame(), run_profiler);
635675

636676
auto* tp = single_thread_mode ? nullptr : session_state.GetInterOpThreadPool();
637677

@@ -692,7 +732,7 @@ onnxruntime::Status PartialExecuteThePlan(const SessionState& session_state, gsl
692732

693733
ctx.SetCurrentRange(&state.GetProgramRegions(session_state));
694734

695-
SessionScope session_scope(session_state, ctx.GetExecutionFrame());
735+
SessionScope session_scope(session_state, ctx.GetExecutionFrame(), nullptr);
696736

697737
#if !defined(ORT_MINIMAL_BUILD) && defined(ORT_MEMORY_PROFILE)
698738
// Only flush memory info for the 2nd partial graph execution (since ORTModule runs this function twice).

onnxruntime/core/framework/sequential_executor.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,8 @@ onnxruntime::Status ExecuteThePlan(const SessionState& session_state, gsl::span<
4646
#endif
4747
const bool& terminate_flag,
4848
const bool only_execute_path_to_fetches,
49-
bool single_thread_mode);
49+
bool single_thread_mode,
50+
profiling::Profiler* run_profiler = nullptr);
5051

5152
#ifdef ENABLE_TRAINING
5253
onnxruntime::Status PartialExecuteThePlan(const SessionState& session_state, gsl::span<const int> feed_mlvalue_idxs,

0 commit comments

Comments
 (0)