Description
Context in facebookincubator/dynolog#286.
In pytorch, we have only 2 call sites of libkineto_init (https://github.com/search?q=repo%3Apytorch%2Fpytorch%20libkineto_init&type=code). - One is the to when we need to register ourselves to dynolog (KINETO_USE_DAEMON=1).
- One is for the programmable torch.profiler interface (https://github.com/pytorch/pytorch/blob/6371c25b91fb66756b5a2df67741d4e7a72a7261/torch/csrc/profiler/kineto_shim.cpp#L247).
First one is triggered from global_kineto_init
, but libkineto_init
actually triggers a bunch of cupti API, which defeats the purpose of lazy cupti attach. Therefore we are potentially affecting the perf of the program until cuptiFinalize is called.
I think the right design would be decoupling registration to dynolog from all the cupti initialization. At global_kineto_init
, we only register to dynolog but don't do cupti calls. And then when we get signal from dynolog, we trigger ondemand profile and then attach cupti context, basically following the pattern of the second use case in pytorch above.
This requires
- Some work to separate registration to dynolog from all the cupti initialization in libkineto
- Provide API in libkineto so that
global_kineto_init
in pytorch only trigger the registration to dynolog part - Change pytorch code
global_kineto_init
accordingly.