fix(examples): Verify TrainJob Completion#3331
fix(examples): Verify TrainJob Completion#3331andreyvelich wants to merge 2 commits intokubeflow:masterfrom
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR guards torch.compile(model) calls with torch.cuda.is_available() checks across three example notebooks to prevent failures in local environments where the Inductor backend's required C++ compiler (g++) is not present.
Changes:
- Wrapped
torch.compile(model)in atorch.cuda.is_available()conditional in three MNIST example notebooks. - Reset a hardcoded
execution_counttonullinlocal-training-mnist.ipynb.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
examples/pytorch/image-classification/mnist.ipynb |
Guard torch.compile with CUDA check |
examples/local/local-training-mnist.ipynb |
Guard torch.compile with CUDA check; reset execution count |
examples/local/local-container-mnist.ipynb |
Guard torch.compile with CUDA check |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
42b76bd to
d2c0664
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Krishna-kg732
left a comment
There was a problem hiding this comment.
lgtm
Thanks @andreyvelich
While running some of our examples locally, I noticed that a few of them are failing.
We need to ensure that TrainJob is Complete before Notebook is finished.
For example, we have a bug after this PR: kubeflow/sdk#269
cc @briangallagher @Fiona-Waters @abhijeet-dhumal @kramaranya
Also,
torch.compile(model)should only be set when CUDA is available.Otherwise, it might fail in local environments where g++ compiler is not available.
/assign @kubeflow/kubeflow-trainer-team @Ishtiyaque-Alam @akshaychitneni @robert-bell @kubeflow/kubeflow-sdk-team @akshaychitneni @jaiakash @Krishna-kg732 @XploY04 @vsoch