Skip to content

Commit 56cbe60

Browse files
saileshd1402Bobbins228seanlaiivarshaprasad96andreyvelich
authored
Added test for create-pytorchjob.ipynb python notebook (#2274)
* Added test for create-pytorchjob.ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix yaml syntax Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix uses path Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add actions/checkout Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add bash to action.yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install pip dependencies step Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add quotes for args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add jupyter Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add nbformat_minor: 5 to fix invalid format error Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix job name Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * test papermill-args-yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args1 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args2 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args3 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Parameterize sdk install Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove unnecessary output Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * nbformat normailze Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Training Client Conditions related unit tests (#2253) * test: add unit test for get_job_conditions function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_created function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_running function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_restarting function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_failed function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_succeded function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: improve job condition unit tests efficiency Signed-off-by: Bobbins228 <mcampbel@redhat.com> --------- Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for list_jobs method of the training_client (#2267) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273) Generate clientset, informers, listers and open api spec for v2alpha1 APIs. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Use torchrun to create PyTorchJob from function (#2276) * [SDK] Use torchrun to create PyTorchJob from function Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update PyTorchJob SDK example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add consts for entrypoint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add check for num procs per worker Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for get_job_logs method of the training_client (#2275) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [v2alpha] Move GV related codebase (#2281) Move GV related codebase in v2alpha Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement runtime framework (#2248) * KEP-2170: Implement runtime framework interfaces Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove grep dependency Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Implement ValidateObjects interface to the runtime framework Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rephrase the error message Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Propagate the TrainJob labels and annotations to the JobSet Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove PodAnnotations from the runtime info Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Implement TrainingRuntime ReplicatedJob validation Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Add TODO comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace queueSuspendedTrainJob with queueSuspendedTrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add DeepSpeed Example with Pytorch Operator (#2235) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283) * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Rename RuntimeRef in runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Deepspeed demo dependencies (#2294) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add manifests for Kubeflow Training V2 (#2289) * KEP-2170: Add manifests for Kubeflow Training V2 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix invalid name for webhook config in cert Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move kubebuilder markers to runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use Kubernetes recommended labels Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286) * FSDP Example with PyTorchJob and T5 Fine-Tuning Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Modify text Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement TrainJob Reconciler to manage objects (#2295) * KEP-2170: Implement TrainJob Reconciler to manage objects Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Mode dep-crds to manifests/external-crds Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename run with runtime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove Prometheus Monitoring doc (#2301) Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Decouple JobSet from TrainJob (#2296) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Initialize runtimes before the manager starts (#2306) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310) * Generate SDK models for the Training V2 APIs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create pyproject.toml config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove comments Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix pre-commit Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Create model and dataset initializers (#2303) * KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add abstract classes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add storage URI to config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update .gitignore Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308) * KEP-2170: Implement JobSet and PlainML Plugins Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix nil pointer exception for Trainer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests in runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix lint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement Torch Plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use list for the Info envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix golang ci Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use K8s sets Update error return Use ptr.Deref() for nil values Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use client.Object for Build() call Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove DeepCopy Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove MLPolicy and PodGroupPolicy from the Info object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline error Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove SDK jar file Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add integration test for Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add TODO to calculate PodGroup values in unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Revert the change to add original Runtime Policies to Info Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create const for the DefaultJobReplicas Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Check if PodLabels is empty Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement Initializer builders in the JobSet plugin (#2316) * KEP-2170: Implement Initializer builder in the JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update the SDK models Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove Info from Initializer builder Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update manifests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update pkg/constants/constants.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use var for envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove check manifests from GitHub actions Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move consts to JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add the TrainJob state transition design (#2298) * KEP-2170: Add the TrainJob state transition design Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace actual jobs with TrainJob Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Expand the Creation Failed reasons Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename Completed to Complete Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update tf job examples to tf v2 (#2270) * mnist with summaries updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * tf_sample updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Remove old example - estimator-API, this example has been replaced by distribution_strategy Signed-off-by: yelias <yossi.elias@nokia.com> * Small fix Signed-off-by: yelias <yossi.elias@nokia.com> * Remove unsupported powerPC dockerfiles Signed-off-by: yelias <yossi.elias@nokia.com> * Fix typo in copyright Signed-off-by: yelias <yossi.elias@nokia.com> --------- Signed-off-by: yelias <yossi.elias@nokia.com> Co-authored-by: yelias <yossi.elias@nokia.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add TrainJob conditions (#2322) * KEP-2170: Implement TrainJob conditions Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Fix API comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Make condition message constants Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Stop connecting condition type and reason in JobSet plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin Gloo repository in JAX Dockerfile to a specific commit (#2329) This commit pins the Gloo repository to a specific commit (43b7acbf) in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream. The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit. By using this commit, we bypass the issue and allow the build to complete successfully. Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [fix] Resolve v2alpha API exceptions (#2317) Resolve v2alpha API exceptions by adding necessary listType validations. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Kubernetes to v1.30.7 (#2332) * Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Ignore cache exporting errors in the image building workflows (#2336) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add Torch Distributed Runtime (#2328) * KEP-2170: Add Torch Distributed Runtime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add pip list Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Refine the server-side apply installation args (#2337) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add openapi-generator CLI option to skip SDK v2 test generation (#2338) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade kustomization files to Kustomize v5 (#2326) Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin accelerate package version in trainer (#2340) * Pin accelerate package version in trainer Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> * include new line to pass pre-commit hook Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> --------- Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Replace papermill command with bash script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Typo fix Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Move Checkout step outside action.yaml file Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add newline EOF in script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pass python dependencies as args and pin versions Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update Usage Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install dependencies in yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * set bash flags Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update script args and add more kubernetes versions for tests Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * add gang-scheduler-name to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * move go setup to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * remove -p parameter from script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> --------- Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: yelias <yossi.elias@nokia.com> Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Co-authored-by: Mark Campbell <mcampbel@redhat.com> Co-authored-by: Wei-Cheng Lai <qazwsx0939059006@gmail.com> Co-authored-by: Varsha <varshaprasad96@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Co-authored-by: yu lin <735122171@qq.com> Co-authored-by: Akshay Chitneni <akshayadatta@gmail.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Sophie Hsu <112261858+sophie0730@users.noreply.github.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Co-authored-by: YosiElias <73485442+YosiElias@users.noreply.github.com> Co-authored-by: yelias <yossi.elias@nokia.com> Co-authored-by: Sandipan Panda <87253083+sandipanpanda@users.noreply.github.com> Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Co-authored-by: Oksana Bazylieva <61097730+oksanabaza@users.noreply.github.com> Co-authored-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
1 parent 2392c36 commit 56cbe60

File tree

5 files changed

+204
-51
lines changed

5 files changed

+204
-51
lines changed

.github/workflows/integration-tests.yaml

Lines changed: 4 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -58,40 +58,12 @@ jobs:
5858
- name: Checkout
5959
uses: actions/checkout@v4
6060

61-
- name: Free-Up Disk Space
62-
uses: ./.github/workflows/free-up-disk-space
63-
64-
- name: Setup Python
65-
uses: actions/setup-python@v5
61+
- name: Setup E2E Tests
62+
uses: ./.github/workflows/setup-e2e-test
6663
with:
64+
kubernetes-version: ${{ matrix.kubernetes-version }}
6765
python-version: ${{ matrix.python-version }}
68-
69-
- name: Setup Go
70-
uses: actions/setup-go@v5
71-
with:
72-
go-version-file: go.mod
73-
74-
- name: Create k8s Kind Cluster
75-
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
76-
with:
77-
node_image: kindest/node:${{ matrix.kubernetes-version }}
78-
cluster_name: training-operator-cluster
79-
kubectl_version: ${{ matrix.kubernetes-version }}
80-
81-
- name: Build training-operator
82-
run: |
83-
./scripts/gha/build-image.sh
84-
env:
85-
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
86-
87-
- name: Deploy training operator
88-
run: |
89-
./scripts/gha/setup-training-operator.sh
90-
env:
91-
KIND_CLUSTER: training-operator-cluster
92-
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
93-
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
94-
KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}
66+
gang-scheduler-name: ${{ matrix.gang-scheduler-name }}
9567

9668
- name: Run tests
9769
run: |
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
name: Setup E2E test template
2+
description: A composite action to setup e2e tests
3+
4+
inputs:
5+
kubernetes-version:
6+
required: true
7+
description: Kubernetes version
8+
python-version:
9+
required: true
10+
description: Python version
11+
gang-scheduler-name:
12+
required: false
13+
default: "none"
14+
description: Gang scheduler name
15+
16+
runs:
17+
using: composite
18+
steps:
19+
- name: Free-Up Disk Space
20+
uses: ./.github/workflows/free-up-disk-space
21+
22+
- name: Setup Python
23+
uses: actions/setup-python@v5
24+
with:
25+
python-version: ${{ inputs.python-version }}
26+
27+
- name: Setup Go
28+
uses: actions/setup-go@v5
29+
with:
30+
go-version-file: go.mod
31+
32+
- name: Create k8s Kind Cluster
33+
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
34+
with:
35+
node_image: kindest/node:${{ inputs.kubernetes-version }}
36+
cluster_name: training-operator-cluster
37+
kubectl_version: ${{ inputs.kubernetes-version }}
38+
39+
- name: Build training-operator
40+
shell: bash
41+
run: |
42+
./scripts/gha/build-image.sh
43+
env:
44+
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
45+
46+
- name: Deploy training operator
47+
shell: bash
48+
run: |
49+
./scripts/gha/setup-training-operator.sh
50+
docker system prune -a -f
51+
docker system df
52+
df -h
53+
env:
54+
KIND_CLUSTER: training-operator-cluster
55+
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
56+
GANG_SCHEDULER_NAME: ${{ inputs.gang-scheduler-name }}
57+
KUBERNETES_VERSION: ${{ inputs.kubernetes-version }}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: Test example notebooks
2+
3+
on:
4+
- pull_request
5+
6+
concurrency:
7+
group: ${{ github.workflow }}-${{ github.ref }}
8+
cancel-in-progress: true
9+
10+
jobs:
11+
create-pytorchjob-notebook-test:
12+
runs-on: ubuntu-latest
13+
timeout-minutes: 30
14+
strategy:
15+
fail-fast: false
16+
matrix:
17+
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"]
18+
python-version: ["3.9", "3.10", "3.11"]
19+
steps:
20+
- name: Checkout
21+
uses: actions/checkout@v4
22+
23+
- name: Setup E2E Tests
24+
uses: ./.github/workflows/setup-e2e-test
25+
with:
26+
kubernetes-version: ${{ matrix.kubernetes-version }}
27+
python-version: ${{ matrix.python-version }}
28+
29+
- name: Install Python Dependencies
30+
run: |
31+
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
32+
33+
- name: Run Jupyter Notebook with Papermill
34+
shell: bash
35+
run: |
36+
./scripts/run-notebook.sh \
37+
-i ./examples/pytorch/image-classification/create-pytorchjob.ipynb \
38+
-n default \
39+
-k ./sdk/python

examples/pytorch/image-classification/create-pytorchjob.ipynb

Lines changed: 33 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,20 @@
2424
"The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob."
2525
]
2626
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": null,
30+
"metadata": {
31+
"tags": [
32+
"parameters"
33+
]
34+
},
35+
"outputs": [],
36+
"source": [
37+
"training_python_sdk='kubeflow-training'\n",
38+
"namespace='kubeflow-user-example-com'"
39+
]
40+
},
2741
{
2842
"cell_type": "markdown",
2943
"metadata": {
@@ -42,12 +56,13 @@
4256
"outputs": [],
4357
"source": [
4458
"# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.\n",
45-
"!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python"
59+
"# Install Kubeflow Python SDK\n",
60+
"!pip install {training_python_sdk}"
4661
]
4762
},
4863
{
4964
"cell_type": "code",
50-
"execution_count": 2,
65+
"execution_count": null,
5166
"metadata": {
5267
"pycharm": {
5368
"name": "#%%\n"
@@ -93,7 +108,7 @@
93108
},
94109
{
95110
"cell_type": "code",
96-
"execution_count": 3,
111+
"execution_count": null,
97112
"metadata": {
98113
"pycharm": {
99114
"name": "#%%\n"
@@ -102,12 +117,11 @@
102117
"outputs": [],
103118
"source": [
104119
"name = \"pytorch-dist-mnist-gloo\"\n",
105-
"namespace = \"kubeflow-user-example-com\"\n",
106120
"container_name = \"pytorch\"\n",
107121
"\n",
108122
"container = V1Container(\n",
109123
" name=container_name,\n",
110-
" image=\"gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0\",\n",
124+
" image=\"kubeflow/pytorch-dist-mnist:latest\",\n",
111125
" args=[\"--backend\", \"gloo\"],\n",
112126
")\n",
113127
"\n",
@@ -157,7 +171,7 @@
157171
},
158172
{
159173
"cell_type": "code",
160-
"execution_count": 4,
174+
"execution_count": null,
161175
"metadata": {
162176
"pycharm": {
163177
"name": "#%%\n"
@@ -176,8 +190,8 @@
176190
"# Namespace will be reused in every APIs.\n",
177191
"training_client = TrainingClient(namespace=namespace)\n",
178192
"\n",
179-
"# If `job_kind` is not set in `TrainingClient`, we need to set it for each API.\n",
180-
"training_client.create_job(pytorchjob, job_kind=constants.PYTORCHJOB_KIND)"
193+
"# `job_kind` is set in `TrainingClient`\n",
194+
"training_client.create_job(pytorchjob)"
181195
]
182196
},
183197
{
@@ -195,7 +209,7 @@
195209
},
196210
{
197211
"cell_type": "code",
198-
"execution_count": 5,
212+
"execution_count": null,
199213
"metadata": {
200214
"pycharm": {
201215
"name": "#%%\n"
@@ -214,7 +228,7 @@
214228
}
215229
],
216230
"source": [
217-
"training_client.get_job(name, job_kind=constants.PYTORCHJOB_KIND).metadata.name"
231+
"training_client.get_job(name).metadata.name"
218232
]
219233
},
220234
{
@@ -230,7 +244,7 @@
230244
},
231245
{
232246
"cell_type": "code",
233-
"execution_count": 7,
247+
"execution_count": null,
234248
"metadata": {
235249
"pycharm": {
236250
"name": "#%%\n"
@@ -260,7 +274,7 @@
260274
}
261275
],
262276
"source": [
263-
"training_client.get_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)"
277+
"training_client.get_job_conditions(name=name)"
264278
]
265279
},
266280
{
@@ -276,7 +290,7 @@
276290
},
277291
{
278292
"cell_type": "code",
279-
"execution_count": 8,
293+
"execution_count": null,
280294
"metadata": {
281295
"pycharm": {
282296
"name": "#%%\n"
@@ -302,7 +316,7 @@
302316
}
303317
],
304318
"source": [
305-
"pytorchjob = training_client.wait_for_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)\n",
319+
"pytorchjob = training_client.wait_for_job_conditions(name=name)\n",
306320
"\n",
307321
"print(f\"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}\")"
308322
]
@@ -320,7 +334,7 @@
320334
},
321335
{
322336
"cell_type": "code",
323-
"execution_count": 9,
337+
"execution_count": null,
324338
"metadata": {
325339
"pycharm": {
326340
"name": "#%%\n"
@@ -339,7 +353,7 @@
339353
}
340354
],
341355
"source": [
342-
"training_client.is_job_succeeded(name=name, job_kind=constants.PYTORCHJOB_KIND)"
356+
"training_client.is_job_succeeded(name=name)"
343357
]
344358
},
345359
{
@@ -355,7 +369,7 @@
355369
},
356370
{
357371
"cell_type": "code",
358-
"execution_count": 10,
372+
"execution_count": null,
359373
"metadata": {
360374
"pycharm": {
361375
"name": "#%%\n"
@@ -476,7 +490,7 @@
476490
}
477491
],
478492
"source": [
479-
"training_client.get_job_logs(name=name, job_kind=constants.PYTORCHJOB_KIND)"
493+
"training_client.get_job_logs(name=name)"
480494
]
481495
},
482496
{
@@ -492,7 +506,7 @@
492506
},
493507
{
494508
"cell_type": "code",
495-
"execution_count": 11,
509+
"execution_count": null,
496510
"metadata": {
497511
"pycharm": {
498512
"name": "#%%\n"

scripts/run-notebook.sh

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
#!/bin/bash
2+
3+
# Copyright 2024 The Kubeflow Authors.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
# This bash script is used to run the example notebooks
18+
19+
set -o errexit
20+
set -o nounset
21+
set -o pipefail
22+
23+
NOTEBOOK_INPUT=""
24+
NOTEBOOK_OUTPUT="-" # outputs to console
25+
NAMESPACE="default"
26+
TRAINING_PYTHON_SDK="./sdk/python"
27+
28+
usage() {
29+
echo "Usage: $0 -i <input_notebook> -o <output_notebook> [-p \"<param> <value>\"...] [-y <params.yaml>]"
30+
echo "Options:"
31+
echo " -i Input notebook (required)"
32+
echo " -o Output notebook (required)"
33+
echo " -k Kubeflow Training Operator Python SDK (optional)"
34+
echo " -n Kubernetes namespace used by tests (optional)"
35+
echo " -h Show this help message"
36+
echo "NOTE: papermill, jupyter and ipykernel are required Python dependencies to run Notebooks"
37+
exit 1
38+
}
39+
40+
while getopts "i:o:p:k:n:r:d:h:" opt; do
41+
case "$opt" in
42+
i) NOTEBOOK_INPUT="$OPTARG" ;; # -i for notebook input path
43+
o) NOTEBOOK_OUTPUT="$OPTARG" ;; # -o for notebook output path
44+
k) TRAINING_PYTHON_SDK="$OPTARG" ;; # -k for training operator python sdk
45+
n) NAMESPACE="$OPTARG" ;; # -n for kubernetes namespace used by tests
46+
h) usage ;; # -h for help (usage)
47+
*) usage; exit 1 ;;
48+
esac
49+
done
50+
51+
if [ -z "$NOTEBOOK_INPUT" ]; then
52+
echo "Error: -i notebook input path is required."
53+
exit 1
54+
fi
55+
56+
papermill_cmd="papermill $NOTEBOOK_INPUT $NOTEBOOK_OUTPUT -p training_python_sdk $TRAINING_PYTHON_SDK -p namespace $NAMESPACE"
57+
58+
if ! command -v papermill &> /dev/null; then
59+
echo "Error: papermill is not installed. Please install papermill to proceed."
60+
exit 1
61+
fi
62+
63+
echo "Running command: $papermill_cmd"
64+
$papermill_cmd
65+
66+
if [ $? -ne 0 ]; then
67+
echo "Error: papermill execution failed." >&2
68+
exit 1
69+
fi
70+
71+
echo "Notebook execution completed successfully"

0 commit comments

Comments
 (0)