Add support for Experiment tracking in Model Registry, fixes #1224 #1318

dhirajsb · 2025-07-15T04:22:48Z

Description

Add support for Experiment Tracking APIs similare to MLflow and other registries like wandb.
This is required for experiment tracking using model registry SDK as well as forms the basis of the MLflow SDK integration feature #1225.

See the feature request #1224 for detailed description of enhancements in this PR.

How Has This Been Tested?

Added extensive unit and integration tests for every entity and endpoint for Experiment tracking.
Detailed unit tests for the filterQuery search enhancement.

Merge criteria:

All the commits have been signed-off (To pass the DCO check)

The commits have meaningful messages
Automated tests are provided as part of the PR for major new functionalities; testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work.
Code changes follow the kubeflow contribution guidelines.
For first time contributors: Please reach out to the Reviewers to ensure all tests are being run, ensuring the label ok-to-test has been added to the PR.

If you have UI changes

The developer has added tests or explained why testing cannot be added.
Included any necessary screenshots or gifs if it was a UI change.
Verify that UI/UX changes conform the UX guidelines for Kubeflow.

dhirajsb · 2025-07-15T06:43:05Z

I'm working on a couple of more fixes for minor issues in testing against the MLflow plugin POC.

tarilabs · 2025-07-15T06:51:08Z

/hold ref #1318 (comment)

dhirajsb · 2025-07-17T01:09:43Z

Fixed the last few issues with metric history. All tests now pass with the mlflow plugin coming in another PR.

dhirajsb · 2025-07-17T03:28:53Z

Looks like the embedmd DB testing classes were modified on main. So, I need to rebase this PR again, hopefully tomorrow.

pboyd

This is bigger than I can effectively review in one sitting. Is it possible to split it up?

api/openapi/model-registry.yaml

api/openapi/catalog.yaml

syntaxsdev · 2025-07-17T13:55:55Z

Would it be worth it to remove the mr_openapi/ python src files and do a separate PR?

dhirajsb · 2025-07-17T15:33:57Z

@syntaxsdev I added the generated client files to pass a check that was failing. But they are not part of the MR client SDK and we'd still need the client SDK work you did in another PR after merging this PR.
Or, we'll have to disable that check if we want to merge this PR without the generated python client code.

Signed-off-by: Dhiraj Bokde <[email protected]>

…rvice tests Signed-off-by: Dhiraj Bokde <[email protected]>

…ype properties only support int32, not int64 Signed-off-by: Dhiraj Bokde <[email protected]>

…es in artifact queries Signed-off-by: Dhiraj Bokde <[email protected]>

…riment run metrics Signed-off-by: Dhiraj Bokde <[email protected]>

Signed-off-by: Dhiraj Bokde <[email protected]>

…nflicts Signed-off-by: Dhiraj Bokde <[email protected]>

…ies and custom properties Signed-off-by: Dhiraj Bokde <[email protected]>

clients/python/tests/test_api.py

Signed-off-by: Dhiraj Bokde <[email protected]>

fege

/lgtm

Signed-off-by: Dhiraj Bokde <[email protected]>

dhirajsb · 2025-08-08T03:21:53Z

Identified one more thing that needs to be fixed and detected by fuzzy tests. Working on a fix now. That should be the last one for all fuzzy tests.

Signed-off-by: Dhiraj Bokde <[email protected]>

dhirajsb · 2025-08-08T04:30:18Z

All fuzzer tests are now passing. The PR is ready to merge.

fege · 2025-08-08T09:58:51Z

/verified
/lgtm

rareddy · 2025-08-08T16:25:09Z

Any concerns @Al-Pragliola @tarilabs in moving forward and merging this?

Al-Pragliola · 2025-08-11T11:03:40Z

/lgtm

tarilabs · 2025-08-11T11:10:31Z

Any concerns @Al-Pragliola @tarilabs in moving forward and merging this?

we should account for unresolved comment and ensure the "edge cases" in the backend do not spawn further

same goes for general comment that doing feature this way is not efficient

that said we already discussed 1. Release, 2. merge this, 3. merge MLMD codepath removal PR, did you had another plan of step in mind @rareddy ? 🤔

syntaxsdev · 2025-08-11T14:12:50Z

/lgtm

reviewed delta changes since last approval

tarilabs

again many thanks to @dhirajsb for leading this impressive work and the exceptional review efforts by all involved

per discussions
/approve

google-oss-prow · 2025-08-12T09:24:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tarilabs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rareddy · 2025-08-12T11:59:40Z

Great work by all the folks in seeing this through!

dhirajsb · 2025-08-12T14:56:33Z

Thanks for the detailed reviews on a big PR everyone. Great team work for a huge feature improvement in the next phase of this project.

…#1224 (kubeflow#1318) * feat: initial version of experiments and runs API Signed-off-by: Dhiraj Bokde <[email protected]> * feat: experiments and runs initial implementation (wip) Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fixed failing unit tests for experiments and runs Signed-off-by: Dhiraj Bokde <[email protected]> * fix: added experiment and experimentrun tests Signed-off-by: Dhiraj Bokde <[email protected]> * feat: added DataSet, Metric, and Parameter types Signed-off-by: Dhiraj Bokde <[email protected]> * feat: added implementatio of DataSet, Metric, and Param, including service tests Signed-off-by: Dhiraj Bokde <[email protected]> * fix: replace int properties for timestamps with string because mlmd type properties only support int32, not int64 Signed-off-by: Dhiraj Bokde <[email protected]> * feat: add support for artifactType query param to filter artifact types in artifact queries Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add metrics history endpoint and metric history storage for experiment run metrics Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix artifactType query param type in generated service Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix go lint error in unit test Signed-off-by: Dhiraj Bokde <[email protected]> * fix: filter out metric history from artifacts endpoints Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix metric history name to use last update time to avoid name conflicts Signed-off-by: Dhiraj Bokde <[email protected]> * feat: add filterQuery param on all context types to search by properties and custom properties Signed-off-by: Dhiraj Bokde <[email protected]> * feat: initial version of experiment tracking implemented on embedmd, rebased on main Signed-off-by: Dhiraj Bokde <[email protected]> * feat: add support for filterQuery parameter for all ListResponse endpoints for embedmd datastore Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add support for stepIds query parameter in embedmd datastore Signed-off-by: Dhiraj Bokde <[email protected]> * feat: refactor embedmd db service to use generic repository implementation to reduce code duplication Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add support for artifactType query parameter for embedmd datastore Signed-off-by: Dhiraj Bokde <[email protected]> * fix: use mysql 8.3 in unit tests Signed-off-by: Dhiraj Bokde <[email protected]> * fix: refactor name mapping and default name handling in embedmd datastore Signed-off-by: Dhiraj Bokde <[email protected]> * feat: support updating metrics and parameters by name, fix ignoring metric history when retrieving all artifacts for runs and versions Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add missing generated openapi python client files for PR github action check Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix failing shared db tests Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add support for metric and parameter description, add missing type property migraiton Signed-off-by: Dhiraj Bokde <[email protected]> * chore: update files from main Signed-off-by: Alessio Pragliola <[email protected]> * fix: added missing godoc comments in pkg/api/api.go Signed-off-by: Dhiraj Bokde <[email protected]> * fix: replace ambiguous ArtifactListReponse return type from GetExperimentRunMetricHistory with MetricListResponse Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fixed incorrect artifactType in dataset response, added tests to verify all artifact types Signed-off-by: Dhiraj Bokde <[email protected]> * feat: add validation for endTimeSinceEpoch property on experiment run updates Signed-off-by: Dhiraj Bokde <[email protected]> * Replace value type validation map with a switch in query_translator.go Co-authored-by: Paul Boyd <[email protected]> Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add service e2e tests for filterQuery, fix name query param handling, fix DB tests that didn't use parent id prefix Signed-off-by: Dhiraj Bokde <[email protected]> * chore: code cleanup, replace interface{} with any, added vetting for internal/db/filter Signed-off-by: Dhiraj Bokde <[email protected]> * chore: added flag vF for fixed string grep exclude Signed-off-by: Dhiraj Bokde <[email protected]> * fix: copied orderby and parameters back to registry and catalog to have different values Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fixed mlmd query translator handling of escaped backslashes Signed-off-by: Dhiraj Bokde <[email protected]> * chore: add test to verify parseCustomPropertyField won't panic with a property name ending in dot Signed-off-by: Dhiraj Bokde <[email protected]> * fix: sync generated python client code Signed-off-by: Dhiraj Bokde <[email protected]> * fix: readiness probe tests and new types Signed-off-by: Alessio Pragliola <[email protected]> * chore: refactor readiness_test Signed-off-by: Alessio Pragliola <[email protected]> * fix: ensure parentResourceId is used to filter resource lookup by params, add unit tests for duplicate child resource lookups Signed-off-by: Dhiraj Bokde <[email protected]> * fix: throw an error if a metric value is missing, add test to validate Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix http status error code for invalid ids Signed-off-by: Dhiraj Bokde <[email protected]> * fix: more id validation, fixed filterQuery passing to DB layer Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix failing unit test Signed-off-by: Dhiraj Bokde <[email protected]> * fix: validate experiment id when listing runs Signed-off-by: Dhiraj Bokde <[email protected]> * fix: fix failing validation test after fixing http status codes Signed-off-by: Dhiraj Bokde <[email protected]> * fix: avoid duplicate key errors if externalid is set in metric when creating metric history entries Signed-off-by: Dhiraj Bokde <[email protected]> * fix: add fuzzer tests for experiment runs and new artifact types Signed-off-by: Dhiraj Bokde <[email protected]> * chore: code cleanup and format fuzzer tests Signed-off-by: Dhiraj Bokde <[email protected]> * fix: log error in fuzzer test Signed-off-by: Dhiraj Bokde <[email protected]> * fix: handle null artifact names correctly on create Signed-off-by: Dhiraj Bokde <[email protected]> --------- Signed-off-by: Dhiraj Bokde <[email protected]> Signed-off-by: Alessio Pragliola <[email protected]> Signed-off-by: Alessio Pragliola <[email protected]> Co-authored-by: Alessio Pragliola <[email protected]> Co-authored-by: Alessio Pragliola <[email protected]> Co-authored-by: Paul Boyd <[email protected]> Signed-off-by: syntaxsdev <[email protected]>

google-oss-prow bot requested review from Al-Pragliola and pboyd July 15, 2025 04:22

google-oss-prow bot added the size/XXL label Jul 15, 2025

github-actions bot added the Area/Go REST server label Jul 15, 2025

google-oss-prow bot added the do-not-merge/hold label Jul 15, 2025

dhirajsb force-pushed the feat/experiment-tracking-api branch from 61e46d2 to 1aa1492 Compare July 17, 2025 00:23

github-actions bot added the Area/MR Python client label Jul 17, 2025

pboyd reviewed Jul 17, 2025

View reviewed changes

api/openapi/model-registry.yaml Outdated Show resolved Hide resolved

api/openapi/catalog.yaml Show resolved Hide resolved

dhirajsb mentioned this pull request Jul 17, 2025

feat: mlmd removal from codebase #1267

Merged

5 tasks

dhirajsb force-pushed the feat/experiment-tracking-api branch from ce11089 to f721ae7 Compare July 18, 2025 06:23

dhirajsb added 14 commits July 18, 2025 12:59

feat: initial version of experiments and runs API

334cdd4

Signed-off-by: Dhiraj Bokde <[email protected]>

feat: experiments and runs initial implementation (wip)

c0ad3cd

Signed-off-by: Dhiraj Bokde <[email protected]>

fix: fixed failing unit tests for experiments and runs

e908bfb

Signed-off-by: Dhiraj Bokde <[email protected]>

fix: added experiment and experimentrun tests

3111574

Signed-off-by: Dhiraj Bokde <[email protected]>

feat: added DataSet, Metric, and Parameter types

4324ba7

Signed-off-by: Dhiraj Bokde <[email protected]>

feat: added implementatio of DataSet, Metric, and Param, including se…

dc1c492

…rvice tests Signed-off-by: Dhiraj Bokde <[email protected]>

fix: replace int properties for timestamps with string because mlmd t…

c7d079c

…ype properties only support int32, not int64 Signed-off-by: Dhiraj Bokde <[email protected]>

feat: add support for artifactType query param to filter artifact typ…

e1bd49d

…es in artifact queries Signed-off-by: Dhiraj Bokde <[email protected]>

fix: add metrics history endpoint and metric history storage for expe…

60c762f

…riment run metrics Signed-off-by: Dhiraj Bokde <[email protected]>

fix: fix artifactType query param type in generated service

0f311d0

Signed-off-by: Dhiraj Bokde <[email protected]>

fix: fix go lint error in unit test

122a529

Signed-off-by: Dhiraj Bokde <[email protected]>

fix: filter out metric history from artifacts endpoints

17d3f77

Signed-off-by: Dhiraj Bokde <[email protected]>

fix: fix metric history name to use last update time to avoid name co…

18b8ec1

…nflicts Signed-off-by: Dhiraj Bokde <[email protected]>

feat: add filterQuery param on all context types to search by propert…

f5891ed

…ies and custom properties Signed-off-by: Dhiraj Bokde <[email protected]>

fege reviewed Aug 7, 2025

View reviewed changes

clients/python/tests/test_api.py Show resolved Hide resolved

chore: code cleanup and format fuzzer tests

ce403b2

Signed-off-by: Dhiraj Bokde <[email protected]>

fege approved these changes Aug 7, 2025

View reviewed changes

google-oss-prow bot assigned fege Aug 7, 2025

google-oss-prow bot added the lgtm label Aug 7, 2025

fix: log error in fuzzer test

f99d1a5

Signed-off-by: Dhiraj Bokde <[email protected]>

google-oss-prow bot removed the lgtm label Aug 7, 2025

fix: handle null artifact names correctly on create

05ea1ce

Signed-off-by: Dhiraj Bokde <[email protected]>

google-oss-prow bot assigned Al-Pragliola Aug 11, 2025

google-oss-prow bot added the lgtm label Aug 11, 2025

tarilabs reviewed Aug 12, 2025

View reviewed changes

google-oss-prow bot added the approved label Aug 12, 2025

tarilabs removed the do-not-merge/hold label Aug 12, 2025

google-oss-prow bot merged commit 655a9d5 into kubeflow:main Aug 12, 2025
23 checks passed

dhirajsb deleted the feat/experiment-tracking-api branch August 12, 2025 14:25

mprahl mentioned this pull request Aug 13, 2025

KEP-897: Propose centralized experiment tracking in Kubeflow kubeflow/community#892

Open

Al-Pragliola mentioned this pull request Aug 14, 2025

periodic sync upstream KF to midstream ODH opendatahub-io/model-registry#303

Merged

8 tasks

jonburdo mentioned this pull request Oct 24, 2025

add jonburdo as a reviewer #1796

Merged

Add support for Experiment tracking in Model Registry, fixes #1224 #1318

Add support for Experiment tracking in Model Registry, fixes #1224 #1318

Uh oh!

Conversation

dhirajsb commented Jul 15, 2025

Description

How Has This Been Tested?

Merge criteria:

Uh oh!

dhirajsb commented Jul 15, 2025

Uh oh!

tarilabs commented Jul 15, 2025

Uh oh!

dhirajsb commented Jul 17, 2025

Uh oh!

dhirajsb commented Jul 17, 2025

Uh oh!

pboyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

syntaxsdev commented Jul 17, 2025

Uh oh!

dhirajsb commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fege left a comment

Choose a reason for hiding this comment

Uh oh!

dhirajsb commented Aug 8, 2025

Uh oh!

dhirajsb commented Aug 8, 2025

Uh oh!

fege commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rareddy commented Aug 8, 2025

Uh oh!

Al-Pragliola commented Aug 11, 2025

Uh oh!

tarilabs commented Aug 11, 2025

Uh oh!

syntaxsdev commented Aug 11, 2025

Uh oh!

tarilabs left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 12, 2025

Uh oh!

Uh oh!

rareddy commented Aug 12, 2025

Uh oh!

dhirajsb commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dhirajsb commented Jul 17, 2025 •

edited

Loading

fege commented Aug 8, 2025 •

edited

Loading