Description
Description
We have developed an on-demand profiling daemon for our large-scale cluster by integrating dynolog with kineto. While this tool greatly benefits our users in diagnosing their training jobs, we have encountered a usability issue during interaction.
Our users initiate profiling requests when their training jobs are not actually running. Both dynolog CLI and kineto currently have limited capacity to handle such scenarios gracefully, leading to confusion and unnecessary waiting times for our users.
Feature Request
We are requesting a feature that implements a process status check before attempting to profile. Specifically, the profiling tool should:
Verify if the training task's process is active.
If the process is not running or found, return a clear and specific error message or code to the user or CLI tool.
Prevent the profiling request from proceeding, thereby saving resources and user time.
This functionality will not only enhance user experience but also prevent the profiling daemon from engaging in futile profiling attempts, thereby improving the overall efficiency of our on-demand profiling service.
Thank you for considering this feature addition. Any guidance or suggestions on how to implement this check or if there are already existing techniques we could leverage would be greatly appreciated.