Description
When using on-demand profiling via dynolog
and kineto
, we noticed that, when profiling request configured with iterations, the last profiling iteration took more time than other profiling iterations. The train process is blocked at optimizer.step()
, which calls step
in kineto
, finally, in performRunLoop
, libkineto::api().client()->stop()
took the most time.
At the same time, the processTraceInternal
is executed asynchronously in performRunLoop
, which will not block torch train process.
I'm wondering whether there's a plan to fix this performance issue to make minimal overhead on pytorch training process when on-demand profiling is enabled. it would be very nice if there's already a plan or a proposal. If not, I'd like to make a proposal later.