-
Notifications
You must be signed in to change notification settings - Fork 151
[async-job] E2E Test with Sample Job #1326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[async-job] E2E Test with Sample Job #1326
Conversation
| readinessProbe: | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 60 | ||
| periodSeconds: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a E2E case can take a long time since the readiness probe was set to 60s, despite the server being ready much sooner than that. This is a slight adjustment just to get cycle times lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not ideal, but in order to pass in the all of these variables, we are now dependent on the order of the env variables. We could adjust the sample to map a ConfigMap as an ENV var to help clean this up, but that would not be representative of the "typical" use case (where the job and only the job, in its entirety, is created and applied to the cluster at a single point in time)
jonburdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - works for me with:
IMG_VERSION=e2e make test-e2e-run-sample-job
Eventually this script could probably be a test written in python.
tarilabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks to me ./scripts/deploy_on_kind.sh is trying to load in kind another MR server image, likely some env variable/default settings?
afaik, but yeah its doing a bunch of defaulting (like database deployment, db secrets, etc) just to get the service up and running and usable. |
tarilabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for the update @Crazyglue !
some comments for your consideration, also in follow\up PR if preferred!
jobs/async-upload/tests/integration/test_integration_async_upload.py
Outdated
Show resolved
Hide resolved
42c28bf to
03bfa29
Compare
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
Signed-off-by: Eric Dobroveanu <[email protected]>
bce9784 to
e84e67a
Compare
5af6854 to
e8bd7f6
Compare
…vice Signed-off-by: Eric Dobroveanu <[email protected]>
da54d68 to
584204e
Compare
Signed-off-by: Eric Dobroveanu <[email protected]>
584204e to
26ddd2a
Compare
|
|
||
| env: | ||
| IMG_REGISTRY: ghcr.io | ||
| IMG_ORG: kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ease maintenance in m/s i would really prefer to have these at the top, this way we just adjust OCI reference in a single place when porting. Can you restore these Envs here, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are some issues with how the ENV vars are being passed around to downstream make commands, etc. So I think a more comprehensive refactor of all the make files will be needed. This was really done to get all the tests to actually use the variables it should be using. For example, even providing an IMG to the make file for the mr-server image will not always take the correct IMG_VERSION. There are some hard-coded instances in some of these makefiles.
Let me take a look and see if I can do a minimal refactor with the env vars here restored
tarilabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment I'd like to get clarification on, before merging please?
jobs/async-upload/Makefile
Outdated
| IMG_VERSION=${IMG_VERSION} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \ | ||
| IMG_VERSION=${IMG_VERSION} IMG=${IMG} make image/build ARGS="--load$(if ${DEV_BUILD}, --target dev-build)" && \ | ||
| ,\ | ||
| docker pull $(IMG) && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit suspicious; we're entering this code path when BUILD_IMAGE is true, which is also the default with BUILD_IMAGE ?= true definition; we don't have to docker pull which may pull from remote container registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This second block (the docker pull ...) should be executed when BUILD_IMAGE=false (technically, when BUILD_IMAGE is anything but true).
I added this because during the later step to load the image into the kind registry, it would fail since the image was not present locally. I suppose this block should not be needed anymore since I've changed the ci/cd to always build the image anyways. I will remove them
| # Download the model | ||
| response = requests.get( | ||
| "https://github.com/onnx/models/raw/refs/heads/main/validated/vision/classification/mnist/model/mnist-8.onnx" | ||
| ) | ||
| response.raise_for_status() | ||
|
|
||
| with open(model_file, "wb") as f: | ||
| f.write(response.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a follow-up PR, I would place this file in tests/data to avoid the network call, we will take onnx from one of our examples.
| # Validate the artifact was updated correctly | ||
| assert updated_ma.uri != "PLACEHOLDER", f"URI was not updated: {updated_ma.uri}" | ||
| assert updated_ma.state == ArtifactState.LIVE, f"State was not updated to LIVE: {updated_ma.state}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thank you
4fd086c to
ffe7ebc
Compare
Signed-off-by: Eric Dobroveanu <[email protected]>
ffe7ebc to
f28329e
Compare
tarilabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great work @Crazyglue !
Minor comment below I'd say in followup PRs/ticket because I believe this is good to take this in and unblock further work
/lgtm
/approve
|
|
||
| # Verify initial state | ||
| assert ma.uri == "PLACEHOLDER" | ||
| assert ma.state == ArtifactState.UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me realize in the MR python client we should probably by default create Artifact State live ... (in a followup PR/ticket!) or check which alignment with the pure REST API call
| ifdef IMG | ||
| IMG := ${IMG} | ||
| else ifdef IMG_REGISTRY | ||
| IMG := ${IMG_REGISTRY}/${IMG_ORG}/${IMG_REPO}:${IMG_VERSION} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me this is a great catch, if IMG was defined the makefile would have ignored previously
cc @Al-Pragliola (if we want to do differently I'm also open to followups)
|
|
||
| # MR Server Params | ||
| IMG_VERSION ?= latest | ||
| IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really tempted to
| IMG ?= ghcr.io/kubeflow/model-registry/server:$(IMG_VERSION) | |
| IMG ?= $(JOB_IMG_REGISTRY)/$(JOB_IMG_ORG)/model-registry/server:$(IMG_VERSION) |
but I don't want to diverge on the scope of the PR too much :)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tarilabs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* chore(async-job): add script to setup and run sample job Signed-off-by: Eric Dobroveanu <[email protected]> * chore: adjust readiness probe for faster tests Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): convert bash-based test to python-based Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): add readme for integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure correct make target is run in GH action Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): update lockfile and convert to use boto3 Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): simplify the integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): remove unused job-values.yaml Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure async job has a separate env var from mr service Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): adjust e2e tests to be able to build the images Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): move env vars to the top level Signed-off-by: Eric Dobroveanu <[email protected]> --------- Signed-off-by: Eric Dobroveanu <[email protected]> Signed-off-by: Taj010 <[email protected]>
* chore(async-job): add script to setup and run sample job Signed-off-by: Eric Dobroveanu <[email protected]> * chore: adjust readiness probe for faster tests Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): convert bash-based test to python-based Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): add readme for integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure correct make target is run in GH action Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): update lockfile and convert to use boto3 Signed-off-by: Eric Dobroveanu <[email protected]> * test(async-job): simplify the integration tests Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): remove unused job-values.yaml Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): ensure async job has a separate env var from mr service Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): adjust e2e tests to be able to build the images Signed-off-by: Eric Dobroveanu <[email protected]> * chore(async-job): move env vars to the top level Signed-off-by: Eric Dobroveanu <[email protected]> --------- Signed-off-by: Eric Dobroveanu <[email protected]> Signed-off-by: Taj010 <[email protected]>
Description
This adds a somewhat manually run e2e test case for the sample job provided. It will (in-order):
.onnxfile (mnist) to local file system.onnxfile to the minio instance within the KinD cluster10mby default)How Has This Been Tested?
Locally by running the script and make commands against brand new KinD clusters
Merge criteria:
DCOcheck)ok-to-testhas been added to the PR.If you have UI changes