Skip to content

feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352

Open
SoumyaRaikwar wants to merge 1 commit intokubeflow:mainfrom
SoumyaRaikwar:pytorch-profiler
Open

feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352
SoumyaRaikwar wants to merge 1 commit intokubeflow:mainfrom
SoumyaRaikwar:pytorch-profiler

Conversation

@SoumyaRaikwar
Copy link

What this PR does / why we need it:
This PR introduces the ability to easily profile PyTorch-based CustomTrainer jobs using the official PyTorch Profiler. This significantly improves GPU observability and performance tuning capabilities for users running AI workloads with the Kubeflow SDK.

Changes include:

  • Added enable_profiler (bool) and profiler_dir (string) configurations to the CustomTrainer configuration.
  • Updated get_command_using_train_func in kubernetes and localprocess backends to automatically wrap the user's training function call with torch.profiler.profile when enabled.
  • By default, traces are saved to /artifacts/profile, but this can be overridden via the profiler_dir parameter.
  • Added corresponding unit tests to verify the injected profiler code snippet structure.

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>
Copilot AI review requested due to automatic review settings March 4, 2026 00:01
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@SoumyaRaikwar
Copy link
Author

@astefanutti @kramaranya @szaher , PTAL whenever you have chance, Thanks!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SoumyaRaikwar thanks, that's very useful!

I wonder if we should consider adding this on top of #308 and the new TorchTrainer as CustomTrainer doesn't guarantee it's a PyTorch runtime, WDYT?

cc @andreyvelich @szaher @kramaranya

@SoumyaRaikwar
Copy link
Author

@SoumyaRaikwar thanks, that's very useful!

I wonder if we should consider adding this on top of #308 and the new TorchTrainer as CustomTrainer doesn't guarantee it's a PyTorch runtime, WDYT?

cc @andreyvelich @szaher @kramaranya

@astefanutti You are correct: CustomTrainer is not tied to any framework, so injecting torch.profiler code there wouldn't be safe for non-PyTorch workloads.

I will rebase this on top of #308 once it's merged and move the profiler integration to the new TorchTrainer instead. That way, we can guarantee the runtime is PyTorch and the profiler injection is always valid.

Happy to coordinate with @szaher on this. Let me know if there's anything else, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants