feat(trainer): Add PyTorch Profiler integration to CustomTrainer by SoumyaRaikwar · Pull Request #352 · kubeflow/sdk

SoumyaRaikwar · 2026-03-04T00:01:39Z

What this PR does / why we need it:
This PR introduces the ability to easily profile PyTorch-based CustomTrainer jobs using the official PyTorch Profiler. This significantly improves GPU observability and performance tuning capabilities for users running AI workloads with the Kubeflow SDK.

Changes include:

Added enable_profiler (bool) and profiler_dir (string) configurations to the CustomTrainer configuration.
Updated get_command_using_train_func in kubernetes and localprocess backends to automatically wrap the user's training function call with torch.profiler.profile when enabled.
By default, traces are saved to /artifacts/profile, but this can be overridden via the profiler_dir parameter.
Added corresponding unit tests to verify the injected profiler code snippet structure.

Checklist:

Docs included if any changes are user facing

Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>

google-oss-prow · 2026-03-04T00:01:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-04T00:01:49Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

SoumyaRaikwar · 2026-03-05T21:27:33Z

@astefanutti @kramaranya @szaher , PTAL whenever you have chance, Thanks!

astefanutti

@SoumyaRaikwar thanks, that's very useful!

I wonder if we should consider adding this on top of #308 and the new TorchTrainer as CustomTrainer doesn't guarantee it's a PyTorch runtime, WDYT?

cc @andreyvelich @szaher @kramaranya

SoumyaRaikwar · 2026-03-06T11:34:59Z

@SoumyaRaikwar thanks, that's very useful!

I wonder if we should consider adding this on top of #308 and the new TorchTrainer as CustomTrainer doesn't guarantee it's a PyTorch runtime, WDYT?

cc @andreyvelich @szaher @kramaranya

@astefanutti You are correct: CustomTrainer is not tied to any framework, so injecting torch.profiler code there wouldn't be safe for non-PyTorch workloads.

I will rebase this on top of #308 once it's merged and move the profiler integration to the new TorchTrainer instead. That way, we can guarantee the runtime is PyTorch and the profiler injection is always valid.

Happy to coordinate with @szaher on this. Let me know if there's anything else, Thanks.

Added PyTorch Profiler integration to CustomTrainer

82d9822

Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>

Copilot AI review requested due to automatic review settings March 4, 2026 00:01

google-oss-prow bot requested review from Electronic-Waste, kramaranya and szaher March 4, 2026 00:01

google-oss-prow bot added the size/M label Mar 4, 2026

Copilot started reviewing on behalf of SoumyaRaikwar March 4, 2026 00:02 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

SoumyaRaikwar mentioned this pull request Mar 4, 2026

Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK #165

Open

astefanutti reviewed Mar 6, 2026

View reviewed changes

SoumyaRaikwar requested a review from astefanutti March 6, 2026 11:35

SoumyaRaikwar mentioned this pull request Mar 9, 2026

chore: Trainer: Specialized Trainers #308

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352

feat(trainer): Add PyTorch Profiler integration to CustomTrainer#352
SoumyaRaikwar wants to merge 1 commit intokubeflow:mainfrom
SoumyaRaikwar:pytorch-profiler

SoumyaRaikwar commented Mar 4, 2026

Uh oh!

google-oss-prow bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

SoumyaRaikwar commented Mar 5, 2026

Uh oh!

astefanutti left a comment •

edited

Loading

Uh oh!

SoumyaRaikwar commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SoumyaRaikwar commented Mar 4, 2026

Uh oh!

google-oss-prow bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

SoumyaRaikwar commented Mar 5, 2026

Uh oh!

astefanutti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SoumyaRaikwar commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

astefanutti left a comment •

edited

Loading