Skip to content

Conversation

@Crazyglue
Copy link
Contributor

Description

This adds a somewhat manually run e2e test case for the sample job provided. It will (in-order):

  1. Start a KinD cluster and load all dependent services (MR, Minio, Container Registry, etc)
  2. Create some sample data in the MR Service (RegisteredModel, ModelVersion, and ModelArtifact)
  3. Download a tiny .onnx file (mnist) to local file system
  4. Upload that .onnx file to the minio instance within the KinD cluster
  5. Using the results from above, create a Kustomize patch-file to apply to the sample job manifest
  6. Apply those manifests to kick off the Async Job
  7. Awaits its completion (10m by default)

How Has This Been Tested?

Locally by running the script and make commands against brand new KinD clusters

Merge criteria:

  • All the commits have been signed-off (To pass the DCO check)
  • The commits have meaningful messages
  • Automated tests are provided as part of the PR for major new functionalities; testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.
  • For first time contributors: Please reach out to the Reviewers to ensure all tests are being run, ensuring the label ok-to-test has been added to the PR.

If you have UI changes

  • The developer has added tests or explained why testing cannot be added.
  • Included any necessary screenshots or gifs if it was a UI change.
  • Verify that UI/UX changes conform the UX guidelines for Kubeflow.

readinessProbe:
initialDelaySeconds: 10
periodSeconds: 60
periodSeconds: 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a E2E case can take a long time since the readiness probe was set to 60s, despite the server being ready much sooner than that. This is a slight adjustment just to get cycle times lower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not ideal, but in order to pass in the all of these variables, we are now dependent on the order of the env variables. We could adjust the sample to map a ConfigMap as an ENV var to help clean this up, but that would not be representative of the "typical" use case (where the job and only the job, in its entirety, is created and applied to the cluster at a single point in time)

Copy link
Member

@jonburdo jonburdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - works for me with:

IMG_VERSION=e2e make test-e2e-run-sample-job 

Eventually this script could probably be a test written in python.

Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks to me ./scripts/deploy_on_kind.sh is trying to load in kind another MR server image, likely some env variable/default settings?

@Crazyglue
Copy link
Contributor Author

looks to me ./scripts/deploy_on_kind.sh is trying to load in kind another MR server image, likely some env variable/default settings?

afaik, deploy_on_kind.sh will check to see if the kind cluster exists, create it if need-be, and ensure the MR service is running in that cluster. Its used both in the mr-client e2e tests and on the async-job's e2e tests. i guess it has some levers to change whether it pulls an existing image vs builds the image from source before loading it into the kind cluster.

but yeah its doing a bunch of defaulting (like database deployment, db secrets, etc) just to get the service up and running and usable.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jul 18, 2025
Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the update @Crazyglue !

some comments for your consideration, also in follow\up PR if preferred!

@Crazyglue Crazyglue requested a review from tarilabs July 21, 2025 14:58
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch from 42c28bf to 03bfa29 Compare July 21, 2025 15:04
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch 6 times, most recently from bce9784 to e84e67a Compare July 22, 2025 14:11
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch 6 times, most recently from 5af6854 to e8bd7f6 Compare July 22, 2025 14:48
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch 2 times, most recently from da54d68 to 584204e Compare July 22, 2025 15:16
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch from 584204e to 26ddd2a Compare July 22, 2025 17:36

env:
IMG_REGISTRY: ghcr.io
IMG_ORG: kubeflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease maintenance in m/s i would really prefer to have these at the top, this way we just adjust OCI reference in a single place when porting. Can you restore these Envs here, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are some issues with how the ENV vars are being passed around to downstream make commands, etc. So I think a more comprehensive refactor of all the make files will be needed. This was really done to get all the tests to actually use the variables it should be using. For example, even providing an IMG to the make file for the mr-server image will not always take the correct IMG_VERSION. There are some hard-coded instances in some of these makefiles.

Let me take a look and see if I can do a minimal refactor with the env vars here restored

Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment I'd like to get clarification on, before merging please?

IMG_VERSION=${IMG_VERSION} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \
IMG_VERSION=${IMG_VERSION} IMG=${IMG} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \
,\
docker pull $(IMG) && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit suspicious; we're entering this code path when BUILD_IMAGE is true, which is also the default with BUILD_IMAGE ?= true definition; we don't have to docker pull which may pull from remote container registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second block (the docker pull ...) should be executed when BUILD_IMAGE=false (technically, when BUILD_IMAGE is anything but true).

I added this because during the later step to load the image into the kind registry, it would fail since the image was not present locally. I suppose this block should not be needed anymore since I've changed the ci/cd to always build the image anyways. I will remove them

Comment on lines +285 to +292
# Download the model
response = requests.get(
"https://github.com/onnx/models/raw/refs/heads/main/validated/vision/classification/mnist/model/mnist-8.onnx"
)
response.raise_for_status()

with open(model_file, "wb") as f:
f.write(response.content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow-up PR, I would place this file in tests/data to avoid the network call, we will take onnx from one of our examples.

Comment on lines +347 to +349
# Validate the artifact was updated correctly
assert updated_ma.uri != "PLACEHOLDER", f"URI was not updated: {updated_ma.uri}"
assert updated_ma.state == ArtifactState.LIVE, f"State was not updated to LIVE: {updated_ma.state}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thank you

@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch 3 times, most recently from 4fd086c to ffe7ebc Compare July 23, 2025 19:06
@Crazyglue Crazyglue force-pushed the test/async-job-e2e-expansion branch from ffe7ebc to f28329e Compare July 23, 2025 19:10
Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great work @Crazyglue !

Minor comment below I'd say in followup PRs/ticket because I believe this is good to take this in and unblock further work

/lgtm
/approve


# Verify initial state
assert ma.uri == "PLACEHOLDER"
assert ma.state == ArtifactState.UNKNOWN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me realize in the MR python client we should probably by default create Artifact State live ... (in a followup PR/ticket!) or check which alignment with the pure REST API call

Comment on lines +34 to +37
ifdef IMG
IMG := ${IMG}
else ifdef IMG_REGISTRY
IMG := ${IMG_REGISTRY}/${IMG_ORG}/${IMG_REPO}:${IMG_VERSION}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me this is a great catch, if IMG was defined the makefile would have ignored previously

cc @Al-Pragliola (if we want to do differently I'm also open to followups)


# MR Server Params
IMG_VERSION ?= latest
IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really tempted to

Suggested change
IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION)
IMG ?= $(JOB_IMG_REGISTRY)/$(JOB_IMG_ORG)/model-registry/server:$(IMG_VERSION)

but I don't want to diverge on the scope of the PR too much :)

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 8356c82 into kubeflow:main Jul 24, 2025
25 checks passed
@Crazyglue Crazyglue deleted the test/async-job-e2e-expansion branch July 24, 2025 13:20
Taj010 pushed a commit to Taj010/model-registry that referenced this pull request Aug 8, 2025
* chore(async-job): add script to setup and run sample job

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore: adjust readiness probe for faster tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): convert bash-based test to python-based

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): add readme for integration tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): ensure correct make target is run in GH action

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): update lockfile and convert to use boto3

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): simplify the integration tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): remove unused job-values.yaml

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): ensure async job has a separate env var from mr service

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): adjust e2e tests to be able to build the images

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): move env vars to the top level

Signed-off-by: Eric Dobroveanu <[email protected]>

---------

Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Taj010 <[email protected]>
Taj010 pushed a commit to Taj010/model-registry that referenced this pull request Aug 8, 2025
* chore(async-job): add script to setup and run sample job

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore: adjust readiness probe for faster tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): convert bash-based test to python-based

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): add readme for integration tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): ensure correct make target is run in GH action

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): update lockfile and convert to use boto3

Signed-off-by: Eric Dobroveanu <[email protected]>

* test(async-job): simplify the integration tests

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): remove unused job-values.yaml

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): ensure async job has a separate env var from mr service

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): adjust e2e tests to be able to build the images

Signed-off-by: Eric Dobroveanu <[email protected]>

* chore(async-job): move env vars to the top level

Signed-off-by: Eric Dobroveanu <[email protected]>

---------

Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Taj010 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants