Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
b71a690
Add Changelog for Trainer v2.0.0-rc.0 (#2666)
kramaranya Jun 11, 2025
29ffc84
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on Ope…
astefanutti Jun 27, 2025
582025b
chore(helm): Sync ClusterRule in Helm chart (#2686)
astefanutti Jun 27, 2025
a480baa
chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2685)
andreyvelich Jun 27, 2025
d3ada62
KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2382)
Doris-xm Jun 30, 2025
c65d4a7
fix(manifests): Update manifests to enable LLM fine-tuning workflow w…
Electronic-Waste Jun 30, 2025
514503d
fix(plugins): Fix some errors in torchtune mutation process. (#2675)
Electronic-Waste Jun 30, 2025
c17a162
chore: Remove the vendor specific parameters (#2691)
tenzen-y Jun 30, 2025
4d47314
chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#…
tenzen-y Jun 30, 2025
f89bfd3
feat: Mutable PodSpecOverrides for suspended TrainJob (#2683)
astefanutti Jul 1, 2025
b93052a
feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670)
Electronic-Waste Jul 2, 2025
b9c6635
chore: Copy generated CRDs into Helm charts (#2703)
astefanutti Jul 3, 2025
1ea2712
feat: Add schedulingGates to PodSpecOverrides (#2700)
astefanutti Jul 3, 2025
98c62a6
fix(module): Change Go module name to v2 (#2707)
andreyvelich Jul 3, 2025
f0d7ea3
chore(docs): Update Release Guide (#2710)
andreyvelich Jul 4, 2025
49c1518
chore(docs): Add Changelog for v2.0.0-rc.1 (#2709)
andreyvelich Jul 5, 2025
bc9365b
Add Red Hat to ADOPTERS.md (#2714)
terrytangyuan Jul 5, 2025
ed5f859
fix(manifests): fix position of labels of dataset-initializer from po…
rudeigerc Jul 8, 2025
cf3b420
chore: Upgrade JobSet to version 0.8.2 (#2726)
astefanutti Jul 11, 2025
2ff9aa6
feat(docs): Guide to report security vulnerability (#2718)
andreyvelich Jul 11, 2025
857e7cb
fix(manifests): add rbac config of events for event recorders (#2731)
rudeigerc Jul 14, 2025
b8500e5
chore(ci): Add GitHub action to verify PR titles (#2724)
andreyvelich Jul 14, 2025
853226a
fix: fix the command for fetching Kubeflow Trainer version in the iss…
rudeigerc Jul 14, 2025
705cdbf
feat(operator): force trainjob name to be compliant with RFC 1035 for…
rudeigerc Jul 16, 2025
f8d71ec
chore: update github runners to oci gh arc runners (#2739)
koksay Jul 17, 2025
009508c
chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 (#2743)
andreyvelich Jul 21, 2025
d997dd9
feat(docs): Kubeflow Trainer ROADMAP 2025 (#2748)
andreyvelich Jul 25, 2025
45ce64f
Upgrade Kubernetes to v1.33 (#2756)
astefanutti Jul 28, 2025
8251450
fix(test): Fix Ginkgo command for integration tests (#2758)
astefanutti Jul 28, 2025
66da5ea
feat(runtimes): Remove command from the Runtimes with CustomTrainer (…
andreyvelich Jul 29, 2025
aa46e02
feat(runtimes): Add Framework Label to the Runtimes (#2761)
andreyvelich Jul 31, 2025
cfc25d5
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2…
Doris-xm Aug 1, 2025
bc3a248
fix(examples): Update the argument for Runtime framework (#2766)
andreyvelich Aug 4, 2025
05b4c45
fix(api): Fix license path for Kubeflow Trainer Python API (#2771)
andreyvelich Aug 4, 2025
8f3a96d
feat(docs): Introduce latest news to the README (#2769)
andreyvelich Aug 4, 2025
5843849
fix(ci): Remove coverage from Go integration tests (#2773)
andreyvelich Aug 5, 2025
c38fe6d
fix(docs): update KEP-2401 according to current implementation. (#2765)
Electronic-Waste Aug 5, 2025
9011ad7
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774)
andreyvelich Aug 5, 2025
58ff1b7
feat: Add security contexts to controller managers (#2759)
kunal-511 Aug 6, 2025
dc48b82
feat: run workflows on `/ok-to-test` label (#2639)
milinddethe15 Aug 6, 2025
f9ec85f
fix(api): update license path for kubeflow_trainer_api (#2778)
kramaranya Aug 7, 2025
2d4be6f
fix: update kubeflow sdk reference (#2780)
kramaranya Aug 7, 2025
4cd66b4
chore(runtimes): Update packages in DeepSpeed runtime and fix T5 exam…
andreyvelich Aug 7, 2025
2283173
feat(initializer): Updated base image to Debian image and changed ins…
Debabrata47 Aug 11, 2025
db48a78
feat(docs): How to release Python API modules (#2786)
andreyvelich Aug 11, 2025
952f943
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767)
juniemariam Aug 13, 2025
f12a6d3
fix(manifests): Add missing permissions for the RuntimeClass and Limi…
tenzen-y Aug 13, 2025
0421018
fix(ci): disable `Unit and Integration Test - Go` gh action in forked…
milinddethe15 Aug 15, 2025
c534799
chore(runtimes): Remove MPI pi Runtime (#2760)
andreyvelich Aug 18, 2025
d705bb6
chore: Add unit tests for `pkg/apply` (#2479)
akagami-harsh Aug 18, 2025
4443f79
feat(runtimes): Support Distributed MLX on CUDA (#2790)
andreyvelich Aug 21, 2025
0e62c45
chore: Remove tool.hatch.build.targets.wheel from pyproject (#2803)
kramaranya Aug 28, 2025
a350fa0
feat: support affinity in TrainJob pod spec overrides (#2796)
toVersus Aug 28, 2025
a728b3f
chore(operator): Upgrade Kubernetes to v1.34 (#2804)
astefanutti Aug 28, 2025
c4165eb
fix(api): Regenerate TrainJob CRD (#2805)
astefanutti Aug 28, 2025
10a4723
chore(docs): Add license scan report and status (#2788)
fossabot Aug 29, 2025
5c9784a
feat: KEP-2432: GPU Testing for LLM Blueprints (#2689)
jaiakash Aug 29, 2025
377ce56
chore: deflake test to ensure runtime is created before creating trai…
toVersus Aug 29, 2025
d21e03e
chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808)
andreyvelich Aug 29, 2025
85876b0
fix: teraform for oci gpu based vm (#2810)
jaiakash Sep 2, 2025
61dbc9a
fix(examples): Update get_job_logs() API in examples (#2813)
andreyvelich Sep 3, 2025
b9f0602
feat: support for managing gpu enabled self runner infra (#2762)
jaiakash Sep 4, 2025
0f485ed
fix: update examples to reflect func_args now being unpacked (#2815)
briangallagher Sep 7, 2025
f2e7e0d
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806)
toVersus Sep 8, 2025
6f220ee
feat: add HF token and allow gpu workflow to run from pull request ta…
jaiakash Sep 8, 2025
0d3dd9e
chore(operator): Bump JobSet to v0.9.0 version (#2821)
andreyvelich Sep 9, 2025
eb2e36d
chore(runtimes): update torchtune CTRs with multiple dependson featur…
Electronic-Waste Sep 9, 2025
d13a831
chore: merge test cases using PodSpecOverrides into a single case (#2…
toVersus Sep 9, 2025
65b99a6
fix: read only permission for PRs (#2827)
jaiakash Sep 11, 2025
d239d6a
feat(ci): Add Trivy Vulnerability Scan (#2826)
andreyvelich Sep 11, 2025
c6997fc
fix: read only permission for PRs (#2829)
jaiakash Sep 11, 2025
aca6366
chore(test): add uts for coscheduling plugin. (#2582)
IRONICBo Sep 16, 2025
884be84
feat: qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835)
jaiakash Sep 20, 2025
59c6452
feat: Add a public function to create runtime info objects (#2837)
kaisoz Sep 22, 2025
bbaca24
feat: KEP-2655: Add data cache system (#2755)
akshaychitneni Sep 22, 2025
a4a4b70
chore(ci): Ignore generated files in .gitattributes (#2855)
andreyvelich Sep 25, 2025
01a0763
fix(ci): Add latest image tag only for the master branch (#2854)
andreyvelich Sep 26, 2025
2043af0
chore: Install released version of Kubeflow SDK (#2857)
kramaranya Sep 29, 2025
5e359ba
fix(docs): Update the release document to push all changes (#2865)
andreyvelich Sep 29, 2025
2bb5b9a
feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864)
andreyvelich Sep 29, 2025
b918411
feat(docs): Update Trainer diagram and SDK release (#2867)
andreyvelich Sep 30, 2025
b1b597d
chore(operator): Upgrade JobSet to v0.10.1 (#2875)
astefanutti Oct 3, 2025
70fc13a
chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876)
Electronic-Waste Oct 3, 2025
8c36148
chore(test): Support e2e cluster setup with Podman (#2861)
astefanutti Oct 3, 2025
92c3f9b
feat(docs): KEP-2442-Support JAX Training Runtime (#2643)
mahdikhashan Oct 5, 2025
4e17ce5
feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729)
Doris-xm Oct 6, 2025
25022bd
fix: Allow multiple podSpec overrides to target the same TargetJob (#…
kaisoz Oct 7, 2025
c1bc94d
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus …
astefanutti Oct 7, 2025
d56715c
feat(api): Add PodTemplateOverrides API into TrainJob (#2785)
xigang Oct 9, 2025
2256a81
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882)
xigang Oct 9, 2025
1581a07
feat(cache): KEP-2655: Adding cache initializer (#2793)
akshaychitneni Oct 13, 2025
3ac1307
fix(runtimes): fix missing dependency in torchtune trainer image. (#2…
Electronic-Waste Oct 13, 2025
71f39f5
feat(operator): add config api implementation (#2879)
kapil27 Oct 14, 2025
484ca7e
feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832)
Electronic-Waste Oct 15, 2025
b17c6c9
feat(runtimes): implement clusterTrainingRuntime deprecation process …
tdn21 Oct 16, 2025
4ee787f
fix: charts dependencies (#2892)
ls-2018 Oct 16, 2025
2732af6
chore(ci): Enable Kubernetes API Linter (#2858)
astefanutti Oct 17, 2025
17ed50d
fix(api): Fix lint errors for the config API (#2896)
astefanutti Oct 17, 2025
a95b459
fix(api): Keep mpiImplementation field a pointer (#2897)
astefanutti Oct 17, 2025
5b0e1ca
feat(cache): KEP-2655 - Add build pipeline and address vulnerabilitie…
akshaychitneni Oct 18, 2025
467322a
fix(docs): correct example usage in KEP-2437-Support-Volcano-Schedule…
Doris-xm Oct 18, 2025
3c062ac
feat(runtimes): add support for launcher resource allocation in MPI j…
jskswamy Oct 20, 2025
b02eb5b
feat(operator): Add validation for required containers in replicatedJ…
Electronic-Waste Oct 20, 2025
a2797af
feat: add controller manager configuration helm chart (#2895)
kapil27 Oct 21, 2025
fa5b878
fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901)
astefanutti Oct 21, 2025
732d0fb
Kubeflow Trainer Official Release v2.1.0-rc.0
andreyvelich Oct 21, 2025
a4b50ba
fix(runtimes): Update pip version in the MLX runtime (#2910)
google-oss-robot Oct 31, 2025
6101971
[release-2.1] feat(initializer): add s3 model and dataset initializer…
google-oss-robot Oct 31, 2025
d573266
[release-2.1] chore(operator): Use SSA throughout runtime framework (…
google-oss-robot Oct 31, 2025
cba6e30
[release-2.1] fix(manifests): Fix boolean values defaulting in Helm c…
google-oss-robot Nov 3, 2025
008c75f
Cherry-pick #2906 #2915 #2916 to release-2.1 branch (#2917)
andreyvelich Nov 3, 2025
3a71dd0
Kubeflow Trainer Official Release v2.1.0-rc.1
andreyvelich Nov 3, 2025
3be3331
[release-2.1] feat(cache): KEP-2655 - Supporting readiness probes on …
google-oss-robot Nov 5, 2025
5516c11
[release-2.1] feat: Adding local execution example notebook (#2924)
google-oss-robot Nov 6, 2025
dcd690d
[release-2.1] fix(ci): Fix the Kubeflow SDK installation with Docker …
google-oss-robot Nov 6, 2025
fa39664
[release-2.1] feat(cache): KEP-2655: Adding default runtime with cach…
google-oss-robot Nov 7, 2025
73c9bec
Kubeflow Trainer Official Release v2.1.0
andreyvelich Nov 7, 2025
dc4c368
Revert all commits from release-2.0 not in master
sutaakar Nov 11, 2025
6202f4c
Merge remote-tracking branch 'origin/release-2.1' into odh-rebase
sutaakar Nov 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[flake8]
max-line-length = 100
extend-ignore = W503
extend-ignore = W503, E203
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
**/zz_generated.*.go linguist-generated=true
api/** linguist-generated=true
pkg/client/** linguist-generated=true
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ body:
```
Kubeflow Trainer version:
```bash
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=kubeflow-trainer -o jsonpath="{.items[*].spec.containers[*].image}"

```
Kubeflow Python SDK version:
Expand Down
10 changes: 7 additions & 3 deletions .github/workflows/build-and-push-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ on:
push:
branches:
- master
- 'release-*'
- "release-*"
tags:
- 'v*'
- "v*"
pull_request:

jobs:
Expand Down Expand Up @@ -34,12 +34,16 @@ jobs:
- component-name: deepspeed-runtime
dockerfile: cmd/runtimes/deepspeed/Dockerfile
platforms: linux/amd64,linux/arm64
# TODO (andreyvelich): mlx[cuda] doesn't support arm at the moment: https://github.com/ml-explore/mlx/issues/2469
- component-name: mlx-runtime
dockerfile: cmd/runtimes/mlx/Dockerfile
platforms: linux/arm64
platforms: linux/amd64
- component-name: torchtune-trainer
dockerfile: cmd/trainers/torchtune/Dockerfile
platforms: linux/amd64,linux/arm64
- component-name: data-cache
dockerfile: cmd/data_cache/Dockerfile
platforms: linux/amd64,linux/arm64
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down
44 changes: 44 additions & 0 deletions .github/workflows/check-pr-title.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Check PR Title

on:
pull_request_target:
branches:
- master
types:
- edited
- opened
- reopened
- synchronize

jobs:
check:
name: Check PR Title
runs-on: ubuntu-latest
steps:
- uses: amannn/action-semantic-pull-request@v5.5.3
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
types: |
chore
fix
feat
revert

scopes: |
api
ci
docs
examples
exporter
initializer
manifests
operator
runtimes
scripts
test
cache
requireScope: false
ignoreLabels: |
do-not-merge/work-in-progress
dependencies
72 changes: 72 additions & 0 deletions .github/workflows/gh-workflow-approve.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: Approve Workflow Runs

permissions:
actions: write
contents: read

on:
pull_request_target:
types:
- labeled
- synchronize

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.event.number }}
cancel-in-progress: true

jobs:
ok-to-test:
if: contains(github.event.pull_request.labels.*.name, 'ok-to-test') || github.event_name == 'pull_request_target'
runs-on: ubuntu-latest
continue-on-error: true

steps:
- name: Check if author is a Kubeflow GitHub member
id: membership-check
uses: actions/github-script@v7
with:
script: |
const username = context.payload.pull_request.user.login;
const org = context.repo.owner;
try {
const res = await github.rest.orgs.checkMembershipForUser({
org,
username
});
core.setOutput("is_member", true);
} catch (error) {
if (error.status === 404) {
// User is not a member
core.setOutput("is_member", false);
} else {
throw error;
}
}

- name: Approve Pending Workflow Runs
if: steps.membership-check.outputs.is_member == 'true' || contains(github.event.pull_request.labels.*.name, 'ok-to-test')
uses: actions/github-script@v7
with:
retries: 3
script: |
const request = {
owner: context.repo.owner,
repo: context.repo.repo,
event: "pull_request",
status: "action_required",
head_sha: context.payload.pull_request.head.sha,
}

core.info(`Getting workflow runs that need approval for commit ${request.head_sha}`)
const runs = await github.paginate(github.rest.actions.listWorkflowRunsForRepo, request)

core.info(`Found ${runs.length} workflow runs that need approval`)
for (const run of runs) {
core.info(`Approving workflow run ${run.id}`)
const request = {
owner: context.repo.owner,
repo: context.repo.repo,
run_id: run.id,
}
await github.rest.actions.approveWorkflowRun(request)
}
53 changes: 53 additions & 0 deletions .github/workflows/publish-helm-charts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Publish Helm Charts

on:
push:
branches:
- master
tags:
- "v*"

env:
CHART_PATH: charts/kubeflow-trainer

jobs:
publish:
permissions:
contents: read
packages: write

runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Update appVersion in Chart.yaml
run: |
if [[ "${{ github.ref }}" == refs/tags/* ]]; then
APP_VERSION="${GITHUB_REF#refs/tags/}"
else
APP_VERSION="sha-$(git rev-parse --short=7 HEAD)"
fi

echo "appVersion: ${APP_VERSION}" >> ${{ env.CHART_PATH }}/Chart.yaml
echo "Updated Chart.yaml with appVersion: ${APP_VERSION}"

- name: Build, package, and push Helm chart
run: |
if [[ "${{ github.ref }}" == refs/tags/* ]]; then
VERSION=$(grep '^version:' ${{ env.CHART_PATH }}/Chart.yaml | awk '{print $2}')
else
VERSION="0.0.0-sha-$(git rev-parse --short=7 HEAD)"
fi
echo "Chart version set to: ${VERSION}"

helm dependency build ${{ env.CHART_PATH }}
helm package ${{ env.CHART_PATH }} --version ${VERSION}
helm push kubeflow-trainer-${VERSION}.tgz oci://ghcr.io/kubeflow/charts
2 changes: 1 addition & 1 deletion .github/workflows/template-publish-image/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ runs:
images: ${{ inputs.image }}
tags: |
type=ref,event=tag
type=raw,latest
type=raw,value=latest,enable={{is_default_branch}}
type=sha

- name: Build and Push
Expand Down
93 changes: 93 additions & 0 deletions .github/workflows/test-e2e-gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
name: GPU E2E Test

on:
pull_request:
types: [opened, reopened, synchronize, labeled]

permissions:
contents: read
pull-requests: read

jobs:
gpu-e2e-test:
name: GPU E2E Test
runs-on: oracle-vm-16cpu-a10gpu-240gb

env:
GOPATH: ${{ github.workspace }}/go

defaults:
run:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

strategy:
fail-fast: false
matrix:
kubernetes-version: ["1.33.1"]

steps:
- name: Check GPU label
id: check-label
run: |
if [[ "${{ join(github.event.pull_request.labels.*.name, ',') }}" != *"ok-to-test-gpu-runner"* ]]; then
echo "✅ Skipping GPU E2E tests (label not present)."
echo "skip=true" >> $GITHUB_OUTPUT
exit 0
else
echo "Label found. Requesting environment approval to run GPU tests."
echo "skip=false" >> $GITHUB_OUTPUT
fi

- name: Check out code
if: steps.check-label.outputs.skip == 'false'
uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.sha }}
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

- name: Setup Go
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-go@v5
with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/go.mod

- name: Setup Python
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-python@v5
with:
python-version: 3.11

- name: Install dependencies
if: steps.check-label.outputs.skip == 'false'
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
pip install git+https://github.com/kubeflow/sdk.git@main

- name: Setup cluster with GPU support using nvidia/kind
if: steps.check-label.outputs.skip == 'false'
run: |
make test-e2e-setup-gpu-cluster K8S_VERSION=${{ matrix.kubernetes-version }}

- name: Run e2e test on GPU cluster
if: steps.check-label.outputs.skip == 'false'
run: |
mkdir -p artifacts/notebooks
make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/qwen2_5/qwen2.5-1.5B-with-alpaca.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_qwen2_5_with_alpaca-trainjob-yaml.ipynb TIMEOUT=900

- name: Upload Artifacts to GitHub
if: always()
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.kubernetes-version }}
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
retention-days: 1

delete-kind-cluster:
name: Delete kind Cluster
runs-on: oracle-vm-16cpu-a10gpu-240gb
needs: [gpu-e2e-test]
if: always()
steps:
- name: Delete any existing kind cluster
run: |
sudo kind delete cluster --name kind-gpu && echo "kind cluster has been deleted" || echo "kind cluster doesn't exist"
8 changes: 5 additions & 3 deletions .github/workflows/test-e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
fail-fast: false
matrix:
# Kubernetes versions for e2e tests on Kind cluster.
kubernetes-version: ["1.29.14", "1.30.0", "1.31.0", "1.32.3"]
kubernetes-version: ["1.31.0", "1.32.3", "1.33.1", "1.34.0"]

steps:
- name: Check out code
Expand All @@ -40,8 +40,8 @@ jobs:
echo "Install Papermill"
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5

echo "Install Kubeflow SDK"
pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python
echo "Install Kubeflow SDK with Docker support for local Notebooks"
pip install "kubeflow[docker] @ git+https://github.com/kubeflow/sdk.git@main"

- name: Setup cluster
run: |
Expand All @@ -56,6 +56,8 @@ jobs:
mkdir -p artifacts/notebooks
make test-e2e-notebook NOTEBOOK_INPUT=./examples/pytorch/image-classification/mnist.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_mnist.ipynb TIMEOUT=900
make test-e2e-notebook NOTEBOOK_INPUT=./examples/pytorch/question-answering/fine-tune-distilbert.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_fine-tune-distilbert.ipynb TIMEOUT=900
make test-e2e-notebook NOTEBOOK_INPUT=./examples/local/local-training-mnist.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_local-training-mnist.ipynb TIMEOUT=900
make test-e2e-notebook NOTEBOOK_INPUT=./examples/local/local-container-mnist.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_local-container-mnist.ipynb TIMEOUT=900

# TODO (andreyvelich): Discuss how we can upload artifacts for multiple Notebooks.
- name: Upload Artifacts to GitHub
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/test-go.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ jobs:
generate:
name: Generate
runs-on: ubuntu-latest
if: ${{ github.repository == 'kubeflow/trainer' }}
env:
GOPATH: ${{ github.workspace }}/go
defaults:
Expand Down Expand Up @@ -47,6 +48,7 @@ jobs:
test:
name: Test
runs-on: ubuntu-latest
if: ${{ github.repository == 'kubeflow/trainer' }}
env:
GOPATH: ${{ github.workspace }}/go
defaults:
Expand Down Expand Up @@ -77,4 +79,3 @@ jobs:
with:
path-to-profile: cover.out
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer
parallel: true
Loading
Loading