feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352
feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352SoumyaRaikwar wants to merge 1 commit intokubeflow:mainfrom
Conversation
Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
@astefanutti @kramaranya @szaher , PTAL whenever you have chance, Thanks! |
There was a problem hiding this comment.
@SoumyaRaikwar thanks, that's very useful!
I wonder if we should consider adding this on top of #308 and the new TorchTrainer as CustomTrainer doesn't guarantee it's a PyTorch runtime, WDYT?
@astefanutti You are correct: CustomTrainer is not tied to any framework, so injecting I will rebase this on top of #308 once it's merged and move the profiler integration to the new Happy to coordinate with @szaher on this. Let me know if there's anything else, Thanks. |
What this PR does / why we need it:
This PR introduces the ability to easily profile PyTorch-based CustomTrainer jobs using the official PyTorch Profiler. This significantly improves GPU observability and performance tuning capabilities for users running AI workloads with the Kubeflow SDK.
Changes include:
enable_profiler(bool) andprofiler_dir(string) configurations to the CustomTrainer configuration.kubernetesandlocalprocessbackends to automatically wrap the user's training function call withtorch.profiler.profilewhen enabled./artifacts/profile, but this can be overridden via theprofiler_dirparameter.Checklist: