[async-job] E2E Test with Sample Job #1326

Crazyglue · 2025-07-16T17:42:40Z

Description

This adds a somewhat manually run e2e test case for the sample job provided. It will (in-order):

Start a KinD cluster and load all dependent services (MR, Minio, Container Registry, etc)
Create some sample data in the MR Service (RegisteredModel, ModelVersion, and ModelArtifact)
Download a tiny .onnx file (mnist) to local file system
Upload that .onnx file to the minio instance within the KinD cluster
Using the results from above, create a Kustomize patch-file to apply to the sample job manifest
Apply those manifests to kick off the Async Job
Awaits its completion (10m by default)

How Has This Been Tested?

Locally by running the script and make commands against brand new KinD clusters

Merge criteria:

All the commits have been signed-off (To pass the DCO check)

The commits have meaningful messages
Automated tests are provided as part of the PR for major new functionalities; testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work.
Code changes follow the kubeflow contribution guidelines.
For first time contributors: Please reach out to the Reviewers to ensure all tests are being run, ensuring the label ok-to-test has been added to the PR.

If you have UI changes

The developer has added tests or explained why testing cannot be added.
Included any necessary screenshots or gifs if it was a UI change.
Verify that UI/UX changes conform the UX guidelines for Kubeflow.

Crazyglue · 2025-07-16T17:43:57Z

manifests/kustomize/base/model-registry-deployment.yaml

          readinessProbe:
            initialDelaySeconds: 10
-            periodSeconds: 60
+            periodSeconds: 20


Running a E2E case can take a long time since the readiness probe was set to 60s, despite the server being ready much sooner than that. This is a slight adjustment just to get cycle times lower

jobs/async-upload/samples/setup-and-apply-job.sh

Crazyglue · 2025-07-16T17:47:12Z

jobs/async-upload/samples/patches/job-values.yaml

Its not ideal, but in order to pass in the all of these variables, we are now dependent on the order of the env variables. We could adjust the sample to map a ConfigMap as an ENV var to help clean this up, but that would not be representative of the "typical" use case (where the job and only the job, in its entirety, is created and applied to the cluster at a single point in time)

jonburdo

lgtm - works for me with:

IMG_VERSION=e2e make test-e2e-run-sample-job

Eventually this script could probably be a test written in python.

tarilabs

looks to me ./scripts/deploy_on_kind.sh is trying to load in kind another MR server image, likely some env variable/default settings?

jobs/async-upload/samples/setup-and-apply-job.sh

Crazyglue · 2025-07-17T13:01:43Z

looks to me ./scripts/deploy_on_kind.sh is trying to load in kind another MR server image, likely some env variable/default settings?

afaik, deploy_on_kind.sh will check to see if the kind cluster exists, create it if need-be, and ensure the MR service is running in that cluster. Its used both in the mr-client e2e tests and on the async-job's e2e tests. i guess it has some levers to change whether it pulls an existing image vs builds the image from source before loading it into the kind cluster.

but yeah its doing a bunch of defaulting (like database deployment, db secrets, etc) just to get the service up and running and usable.

tarilabs

thanks a lot for the update @Crazyglue !

some comments for your consideration, also in follow\up PR if preferred!

jobs/async-upload/pyproject.toml

jobs/async-upload/tests/integration/test_integration_async_upload.py

Signed-off-by: Eric Dobroveanu <[email protected]>

…vice Signed-off-by: Eric Dobroveanu <[email protected]>

Signed-off-by: Eric Dobroveanu <[email protected]>

tarilabs · 2025-07-22T17:57:55Z

.github/workflows/async-upload-test.yml


 env:
-  IMG_REGISTRY: ghcr.io
-  IMG_ORG: kubeflow


To ease maintenance in m/s i would really prefer to have these at the top, this way we just adjust OCI reference in a single place when porting. Can you restore these Envs here, please?

Yeah, there are some issues with how the ENV vars are being passed around to downstream make commands, etc. So I think a more comprehensive refactor of all the make files will be needed. This was really done to get all the tests to actually use the variables it should be using. For example, even providing an IMG to the make file for the mr-server image will not always take the correct IMG_VERSION. There are some hard-coded instances in some of these makefiles.

Let me take a look and see if I can do a minimal refactor with the env vars here restored

tarilabs

one minor comment I'd like to get clarification on, before merging please?

tarilabs · 2025-07-23T09:55:33Z

jobs/async-upload/Makefile

-		IMG_VERSION=${IMG_VERSION} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \
+		IMG_VERSION=${IMG_VERSION} IMG=${IMG} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \
+	,\
+		docker pull $(IMG) && \


this is a bit suspicious; we're entering this code path when BUILD_IMAGE is true, which is also the default with BUILD_IMAGE ?= true definition; we don't have to docker pull which may pull from remote container registry?

This second block (the docker pull ...) should be executed when BUILD_IMAGE=false (technically, when BUILD_IMAGE is anything but true).

I added this because during the later step to load the image into the kind registry, it would fail since the image was not present locally. I suppose this block should not be needed anymore since I've changed the ci/cd to always build the image anyways. I will remove them

tarilabs · 2025-07-23T09:59:12Z

jobs/async-upload/tests/integration/test_integration_async_upload.py

+        # Download the model
+        response = requests.get(
+            "https://github.com/onnx/models/raw/refs/heads/main/validated/vision/classification/mnist/model/mnist-8.onnx"
+        )
+        response.raise_for_status()
+
+        with open(model_file, "wb") as f:
+            f.write(response.content)


in a follow-up PR, I would place this file in tests/data to avoid the network call, we will take onnx from one of our examples.

tarilabs · 2025-07-23T09:59:34Z

jobs/async-upload/tests/integration/test_integration_async_upload.py

+        # Validate the artifact was updated correctly
+        assert updated_ma.uri != "PLACEHOLDER", f"URI was not updated: {updated_ma.uri}"
+        assert updated_ma.state == ArtifactState.LIVE, f"State was not updated to LIVE: {updated_ma.state}"


awesome, thank you

Signed-off-by: Eric Dobroveanu <[email protected]>

tarilabs

Thanks a lot for the great work @Crazyglue !

Minor comment below I'd say in followup PRs/ticket because I believe this is good to take this in and unblock further work

/lgtm
/approve

tarilabs · 2025-07-24T06:30:55Z

jobs/async-upload/tests/integration/test_integration_async_upload.py

+
+    # Verify initial state
+    assert ma.uri == "PLACEHOLDER"
+    assert ma.state == ArtifactState.UNKNOWN


this makes me realize in the MR python client we should probably by default create Artifact State live ... (in a followup PR/ticket!) or check which alignment with the pure REST API call

tarilabs · 2025-07-24T06:34:58Z

Makefile

+ifdef IMG
+	IMG := ${IMG}
+else ifdef IMG_REGISTRY
+    IMG := ${IMG_REGISTRY}/${IMG_ORG}/${IMG_REPO}:${IMG_VERSION}


to me this is a great catch, if IMG was defined the makefile would have ignored previously

cc @Al-Pragliola (if we want to do differently I'm also open to followups)

tarilabs · 2025-07-24T06:37:05Z

jobs/async-upload/Makefile

+
+# MR Server Params
+IMG_VERSION ?= latest
+IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION)


I'm really tempted to

Suggested change

IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION)

IMG ?= $(JOB_IMG_REGISTRY)/$(JOB_IMG_ORG)/model-registry/server:$(IMG_VERSION)

but I don't want to diverge on the scope of the PR too much :)

google-oss-prow · 2025-07-24T06:38:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tarilabs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* chore(async-job): add script to setup and run sample job Signed-off-by: Eric Dobroveanu <[email protected]> * chore: adjust readiness probe for faster tests Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): convert bash-based test to python-based Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): add readme for integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure correct make target is run in GH action Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): update lockfile and convert to use boto3 Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): simplify the integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): remove unused job-values.yaml Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure async job has a separate env var from mr service Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): adjust e2e tests to be able to build the images Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): move env vars to the top level Signed-off-by: Eric Dobroveanu <[email protected]> --------- Signed-off-by: Eric Dobroveanu <[email protected]> Signed-off-by: Taj010 <[email protected]>

google-oss-prow bot requested review from andreyvelich and tarilabs July 16, 2025 17:42

github-actions bot added Area/GitHub Area/Manifests labels Jul 16, 2025

google-oss-prow bot added the size/L label Jul 16, 2025

Crazyglue commented Jul 16, 2025

View reviewed changes

jobs/async-upload/samples/setup-and-apply-job.sh Outdated Show resolved Hide resolved

Crazyglue commented Jul 16, 2025

View reviewed changes

jonburdo reviewed Jul 16, 2025

View reviewed changes

tarilabs reviewed Jul 17, 2025

View reviewed changes

jobs/async-upload/samples/setup-and-apply-job.sh Outdated Show resolved Hide resolved

jobs/async-upload/samples/setup-and-apply-job.sh Outdated Show resolved Hide resolved

google-oss-prow bot added size/XL and removed size/L labels Jul 18, 2025

tarilabs reviewed Jul 18, 2025

View reviewed changes

jobs/async-upload/pyproject.toml Outdated Show resolved Hide resolved

jobs/async-upload/tests/integration/test_integration_async_upload.py Outdated Show resolved Hide resolved

jobs/async-upload/tests/integration/test_integration_async_upload.py Show resolved Hide resolved

Crazyglue requested a review from tarilabs July 21, 2025 14:58

Crazyglue force-pushed the test/async-job-e2e-expansion branch from 42c28bf to 03bfa29 Compare July 21, 2025 15:04

Crazyglue added 8 commits July 22, 2025 09:28

chore(async-job): add script to setup and run sample job

5f30dc6

Signed-off-by: Eric Dobroveanu <[email protected]>

chore: adjust readiness probe for faster tests

180ed2b

Signed-off-by: Eric Dobroveanu <[email protected]>

test(async-job): convert bash-based test to python-based

f1683a0

Signed-off-by: Eric Dobroveanu <[email protected]>

test(async-job): add readme for integration tests

da6bd7a

Signed-off-by: Eric Dobroveanu <[email protected]>

chore(async-job): ensure correct make target is run in GH action

08a7afd

Signed-off-by: Eric Dobroveanu <[email protected]>

chore(async-job): update lockfile and convert to use boto3

04650d3

Signed-off-by: Eric Dobroveanu <[email protected]>

test(async-job): simplify the integration tests

d5535c9

Signed-off-by: Eric Dobroveanu <[email protected]>

chore(async-job): remove unused job-values.yaml

82b9161

Signed-off-by: Eric Dobroveanu <[email protected]>

Crazyglue force-pushed the test/async-job-e2e-expansion branch 6 times, most recently from bce9784 to e84e67a Compare July 22, 2025 14:11

Crazyglue force-pushed the test/async-job-e2e-expansion branch 6 times, most recently from 5af6854 to e8bd7f6 Compare July 22, 2025 14:48

chore(async-job): ensure async job has a separate env var from mr ser…

71969a0

…vice Signed-off-by: Eric Dobroveanu <[email protected]>

Crazyglue force-pushed the test/async-job-e2e-expansion branch 2 times, most recently from da54d68 to 584204e Compare July 22, 2025 15:16

chore(async-job): adjust e2e tests to be able to build the images

26ddd2a

Signed-off-by: Eric Dobroveanu <[email protected]>

Crazyglue force-pushed the test/async-job-e2e-expansion branch from 584204e to 26ddd2a Compare July 22, 2025 17:36

tarilabs reviewed Jul 22, 2025

View reviewed changes

tarilabs reviewed Jul 23, 2025

View reviewed changes

Crazyglue force-pushed the test/async-job-e2e-expansion branch 3 times, most recently from 4fd086c to ffe7ebc Compare July 23, 2025 19:06

chore(async-job): move env vars to the top level

f28329e

Signed-off-by: Eric Dobroveanu <[email protected]>

Crazyglue force-pushed the test/async-job-e2e-expansion branch from ffe7ebc to f28329e Compare July 23, 2025 19:10

tarilabs approved these changes Jul 24, 2025

View reviewed changes

google-oss-prow bot assigned tarilabs Jul 24, 2025

google-oss-prow bot added the lgtm label Jul 24, 2025

google-oss-prow bot added the approved label Jul 24, 2025

google-oss-prow bot merged commit 8356c82 into kubeflow:main Jul 24, 2025
25 checks passed

This was referenced Jul 24, 2025

ci: fix root Make image/push #1372

Merged

periodic sync upstream KF to midstream ODH opendatahub-io/model-registry#291

Merged

Crazyglue deleted the test/async-job-e2e-expansion branch July 24, 2025 13:20

jonburdo mentioned this pull request Oct 24, 2025

add jonburdo as a reviewer #1796

Merged

	IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION)
	IMG ?= $(JOB_IMG_REGISTRY)/$(JOB_IMG_ORG)/model-registry/server:$(IMG_VERSION)

[async-job] E2E Test with Sample Job #1326

[async-job] E2E Test with Sample Job #1326

Uh oh!

Conversation

Crazyglue commented Jul 16, 2025

Description

How Has This Been Tested?

Merge criteria:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonburdo left a comment

Choose a reason for hiding this comment

Uh oh!

tarilabs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Crazyglue commented Jul 17, 2025

Uh oh!

tarilabs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarilabs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarilabs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants