Skip to content

fix(examples): Verify TrainJob Completion#3331

Open
andreyvelich wants to merge 2 commits intokubeflow:masterfrom
andreyvelich:fix-torch-compile
Open

fix(examples): Verify TrainJob Completion#3331
andreyvelich wants to merge 2 commits intokubeflow:masterfrom
andreyvelich:fix-torch-compile

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Mar 13, 2026

While running some of our examples locally, I noticed that a few of them are failing.
We need to ensure that TrainJob is Complete before Notebook is finished.
For example, we have a bug after this PR: kubeflow/sdk#269
cc @briangallagher @Fiona-Waters @abhijeet-dhumal @kramaranya

ERROR: Failed to install Python packages: datasets 'transformers[torch]' 'cloudpathlib[all]'
  ERROR: Invalid requirement: "'transformers[torch]'": Expected package name at the start of dependency specifier
      'transformers[torch]'
      ^
  ERROR: Invalid requirement: "'transformers[torch]'": Expected package name at the start of dependency specifier
      'transformers[torch]'
      ^

Also, torch.compile(model) should only be set when CUDA is available.
Otherwise, it might fail in local environments where g++ compiler is not available.

 [rank2]:     compiler = cpp_compiler_search(search)
  [rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/cpp_builder.py", line 105, in cpp_compiler_search
  [rank2]:     raise exc.InvalidCxxCompiler
  [rank2]: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

/assign @kubeflow/kubeflow-trainer-team @Ishtiyaque-Alam @akshaychitneni @robert-bell @kubeflow/kubeflow-sdk-team @akshaychitneni @jaiakash @Krishna-kg732 @XploY04 @vsoch

Copilot AI review requested due to automatic review settings March 13, 2026 22:40
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR guards torch.compile(model) calls with torch.cuda.is_available() checks across three example notebooks to prevent failures in local environments where the Inductor backend's required C++ compiler (g++) is not present.

Changes:

  • Wrapped torch.compile(model) in a torch.cuda.is_available() conditional in three MNIST example notebooks.
  • Reset a hardcoded execution_count to null in local-training-mnist.ipynb.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
examples/pytorch/image-classification/mnist.ipynb Guard torch.compile with CUDA check
examples/local/local-training-mnist.ipynb Guard torch.compile with CUDA check; reset execution count
examples/local/local-container-mnist.ipynb Guard torch.compile with CUDA check

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/S labels Mar 13, 2026
@andreyvelich andreyvelich changed the title fix(examples): Set torch.compile(model) only when CUDA is available fix(examples): Verify TrainJob Completion Mar 13, 2026
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Contributor

@Krishna-kg732 Krishna-kg732 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Thanks @andreyvelich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants