Skip to content

Conversation

@dhirajsb
Copy link
Contributor

Description

Add support for Experiment Tracking APIs similare to MLflow and other registries like wandb.
This is required for experiment tracking using model registry SDK as well as forms the basis of the MLflow SDK integration feature #1225.

See the feature request #1224 for detailed description of enhancements in this PR.

How Has This Been Tested?

Added extensive unit and integration tests for every entity and endpoint for Experiment tracking.
Detailed unit tests for the filterQuery search enhancement.

Merge criteria:

  • All the commits have been signed-off (To pass the DCO check)
  • The commits have meaningful messages
  • Automated tests are provided as part of the PR for major new functionalities; testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.
  • For first time contributors: Please reach out to the Reviewers to ensure all tests are being run, ensuring the label ok-to-test has been added to the PR.

If you have UI changes

  • The developer has added tests or explained why testing cannot be added.
  • Included any necessary screenshots or gifs if it was a UI change.
  • Verify that UI/UX changes conform the UX guidelines for Kubeflow.

@dhirajsb
Copy link
Contributor Author

I'm working on a couple of more fixes for minor issues in testing against the MLflow plugin POC.

@tarilabs
Copy link
Member

/hold ref #1318 (comment)

@dhirajsb
Copy link
Contributor Author

Fixed the last few issues with metric history. All tests now pass with the mlflow plugin coming in another PR.

@dhirajsb
Copy link
Contributor Author

Looks like the embedmd DB testing classes were modified on main. So, I need to rebase this PR again, hopefully tomorrow.

Copy link
Member

@pboyd pboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bigger than I can effectively review in one sitting. Is it possible to split it up?

@syntaxsdev
Copy link
Contributor

Would it be worth it to remove the mr_openapi/ python src files and do a separate PR?

@dhirajsb
Copy link
Contributor Author

dhirajsb commented Jul 17, 2025

@syntaxsdev I added the generated client files to pass a check that was failing. But they are not part of the MR client SDK and we'd still need the client SDK work you did in another PR after merging this PR.
Or, we'll have to disable that check if we want to merge this PR without the generated python client code.

@dhirajsb dhirajsb mentioned this pull request Jul 17, 2025
5 tasks
@dhirajsb dhirajsb force-pushed the feat/experiment-tracking-api branch from ce11089 to f721ae7 Compare July 18, 2025 06:23
dhirajsb added 14 commits July 18, 2025 12:59
…ype properties only support int32, not int64

Signed-off-by: Dhiraj Bokde <[email protected]>
…ies and custom properties

Signed-off-by: Dhiraj Bokde <[email protected]>
Copy link
Contributor

@fege fege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Signed-off-by: Dhiraj Bokde <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Aug 7, 2025
@dhirajsb
Copy link
Contributor Author

dhirajsb commented Aug 8, 2025

Identified one more thing that needs to be fixed and detected by fuzzy tests. Working on a fix now. That should be the last one for all fuzzy tests.

@dhirajsb
Copy link
Contributor Author

dhirajsb commented Aug 8, 2025

All fuzzer tests are now passing. The PR is ready to merge.

@fege
Copy link
Contributor

fege commented Aug 8, 2025

/verified
/lgtm

@rareddy
Copy link
Contributor

rareddy commented Aug 8, 2025

Any concerns @Al-Pragliola @tarilabs in moving forward and merging this?

@Al-Pragliola
Copy link
Contributor

/lgtm

@tarilabs
Copy link
Member

Any concerns @Al-Pragliola @tarilabs in moving forward and merging this?

we should account for unresolved comment and ensure the "edge cases" in the backend do not spawn further

same goes for general comment that doing feature this way is not efficient

that said we already discussed 1. Release, 2. merge this, 3. merge MLMD codepath removal PR, did you had another plan of step in mind @rareddy ? 🤔

@syntaxsdev
Copy link
Contributor

/lgtm

reviewed delta changes since last approval

Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again many thanks to @dhirajsb for leading this impressive work and the exceptional review efforts by all involved

per discussions
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 655a9d5 into kubeflow:main Aug 12, 2025
23 checks passed
@rareddy
Copy link
Contributor

rareddy commented Aug 12, 2025

Great work by all the folks in seeing this through!

@dhirajsb dhirajsb deleted the feat/experiment-tracking-api branch August 12, 2025 14:25
@dhirajsb
Copy link
Contributor Author

Thanks for the detailed reviews on a big PR everyone. Great team work for a huge feature improvement in the next phase of this project.

syntaxsdev pushed a commit to syntaxsdev/model-registry that referenced this pull request Aug 21, 2025
…#1224 (kubeflow#1318)

* feat: initial version of experiments and runs API

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: experiments and runs initial implementation (wip)

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed failing unit tests for experiments and runs

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: added experiment and experimentrun tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: added DataSet, Metric, and Parameter types

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: added implementatio of DataSet, Metric, and Param, including service tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: replace int properties for timestamps with string because mlmd type properties only support int32, not int64

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add support for artifactType query param to filter artifact types in artifact queries

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add metrics history endpoint and metric history storage for experiment run metrics

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix artifactType query param type in generated service

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix go lint error in unit test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: filter out metric history from artifacts endpoints

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix metric history name to use last update time to avoid name conflicts

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add filterQuery param on all context types to search by properties and custom properties

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: initial version of experiment tracking implemented on embedmd, rebased on main

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add support for filterQuery parameter for all ListResponse endpoints for embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for stepIds query parameter in embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: refactor embedmd db service to use generic repository implementation to reduce code duplication

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for artifactType query parameter for embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: use mysql 8.3 in unit tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: refactor name mapping and default name handling in embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: support updating metrics and parameters by name, fix ignoring metric history when retrieving all artifacts for runs and versions

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add missing generated openapi python client files for PR github action check

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing shared db tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for metric and parameter description, add missing type property migraiton

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: update files from main

Signed-off-by: Alessio Pragliola <[email protected]>

* fix: added missing godoc comments in pkg/api/api.go

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: replace ambiguous ArtifactListReponse return type from GetExperimentRunMetricHistory with MetricListResponse

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed incorrect artifactType in dataset response, added tests to verify all artifact types

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add validation for endTimeSinceEpoch property on experiment run updates

Signed-off-by: Dhiraj Bokde <[email protected]>

* Replace value type validation map with a switch in query_translator.go

Co-authored-by: Paul Boyd <[email protected]>
Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add service e2e tests for filterQuery, fix name query param handling, fix DB tests that didn't use parent id prefix

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: code cleanup, replace interface{} with any, added vetting for internal/db/filter

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: added flag vF for fixed string grep exclude

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: copied orderby and parameters back to registry and catalog to have different values

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed mlmd query translator handling of escaped backslashes

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: add test to verify parseCustomPropertyField won't panic with a property name ending in dot

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: sync generated python client code

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: readiness probe tests and new types

Signed-off-by: Alessio Pragliola <[email protected]>

* chore: refactor readiness_test

Signed-off-by: Alessio Pragliola <[email protected]>

* fix: ensure parentResourceId is used to filter resource lookup by params, add unit tests for duplicate child resource lookups

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: throw an error if a metric value is missing, add test to validate

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix http status error code for invalid ids

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: more id validation, fixed filterQuery passing to DB layer

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing unit test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: validate experiment id when listing runs

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing validation test after fixing http status codes

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: avoid duplicate key errors if externalid is set in metric when creating metric history entries

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add fuzzer tests for experiment runs and new artifact types

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: code cleanup and format fuzzer tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: log error in fuzzer test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: handle null artifact names correctly on create

Signed-off-by: Dhiraj Bokde <[email protected]>

---------

Signed-off-by: Dhiraj Bokde <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
Co-authored-by: Alessio Pragliola <[email protected]>
Co-authored-by: Alessio Pragliola <[email protected]>
Co-authored-by: Paul Boyd <[email protected]>
Signed-off-by: syntaxsdev <[email protected]>
syntaxsdev pushed a commit to syntaxsdev/model-registry that referenced this pull request Aug 21, 2025
…#1224 (kubeflow#1318)

* feat: initial version of experiments and runs API

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: experiments and runs initial implementation (wip)

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed failing unit tests for experiments and runs

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: added experiment and experimentrun tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: added DataSet, Metric, and Parameter types

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: added implementatio of DataSet, Metric, and Param, including service tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: replace int properties for timestamps with string because mlmd type properties only support int32, not int64

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add support for artifactType query param to filter artifact types in artifact queries

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add metrics history endpoint and metric history storage for experiment run metrics

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix artifactType query param type in generated service

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix go lint error in unit test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: filter out metric history from artifacts endpoints

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix metric history name to use last update time to avoid name conflicts

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add filterQuery param on all context types to search by properties and custom properties

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: initial version of experiment tracking implemented on embedmd, rebased on main

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add support for filterQuery parameter for all ListResponse endpoints for embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for stepIds query parameter in embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: refactor embedmd db service to use generic repository implementation to reduce code duplication

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for artifactType query parameter for embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: use mysql 8.3 in unit tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: refactor name mapping and default name handling in embedmd datastore

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: support updating metrics and parameters by name, fix ignoring metric history when retrieving all artifacts for runs and versions

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add missing generated openapi python client files for PR github action check

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing shared db tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add support for metric and parameter description, add missing type property migraiton

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: update files from main

Signed-off-by: Alessio Pragliola <[email protected]>

* fix: added missing godoc comments in pkg/api/api.go

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: replace ambiguous ArtifactListReponse return type from GetExperimentRunMetricHistory with MetricListResponse

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed incorrect artifactType in dataset response, added tests to verify all artifact types

Signed-off-by: Dhiraj Bokde <[email protected]>

* feat: add validation for endTimeSinceEpoch property on experiment run updates

Signed-off-by: Dhiraj Bokde <[email protected]>

* Replace value type validation map with a switch in query_translator.go

Co-authored-by: Paul Boyd <[email protected]>
Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add service e2e tests for filterQuery, fix name query param handling, fix DB tests that didn't use parent id prefix

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: code cleanup, replace interface{} with any, added vetting for internal/db/filter

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: added flag vF for fixed string grep exclude

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: copied orderby and parameters back to registry and catalog to have different values

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fixed mlmd query translator handling of escaped backslashes

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: add test to verify parseCustomPropertyField won't panic with a property name ending in dot

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: sync generated python client code

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: readiness probe tests and new types

Signed-off-by: Alessio Pragliola <[email protected]>

* chore: refactor readiness_test

Signed-off-by: Alessio Pragliola <[email protected]>

* fix: ensure parentResourceId is used to filter resource lookup by params, add unit tests for duplicate child resource lookups

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: throw an error if a metric value is missing, add test to validate

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix http status error code for invalid ids

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: more id validation, fixed filterQuery passing to DB layer

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing unit test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: validate experiment id when listing runs

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: fix failing validation test after fixing http status codes

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: avoid duplicate key errors if externalid is set in metric when creating metric history entries

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: add fuzzer tests for experiment runs and new artifact types

Signed-off-by: Dhiraj Bokde <[email protected]>

* chore: code cleanup and format fuzzer tests

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: log error in fuzzer test

Signed-off-by: Dhiraj Bokde <[email protected]>

* fix: handle null artifact names correctly on create

Signed-off-by: Dhiraj Bokde <[email protected]>

---------

Signed-off-by: Dhiraj Bokde <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
Co-authored-by: Alessio Pragliola <[email protected]>
Co-authored-by: Alessio Pragliola <[email protected]>
Co-authored-by: Paul Boyd <[email protected]>
Signed-off-by: syntaxsdev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants