-
Notifications
You must be signed in to change notification settings - Fork 11
WIP: Docs draft for integration with DVC #323
base: main
Are you sure you want to change the base?
Changes from 11 commits
edd81ae
76b9cd7
faa44dc
ec9d8d7
16e10d7
6308803
dd08adf
b71561b
56f393e
2bf18ca
e8156ef
446ed3a
6b7f890
0936229
d80e5f3
35470c1
1664b17
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,176 @@ | ||||||
| # Get Started DVC | ||||||
|
|
||||||
| To leverage concepts of Model and Data Registries in a more explicit way, you | ||||||
| can denote the `type` of each output. This will let you browse models and data | ||||||
| separately, address them by `name` in `dvc get`, and eventually, see them in DVC | ||||||
| Studio. | ||||||
|
|
||||||
| Let's start with marking an artifact as data or model. | ||||||
|
||||||
| Let's start with marking an artifact as data or model. | |
| Let's start with marking a tracked artifact (file) as a `model`. |
Personally, I don't think that "data" is a valid example for a type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I assumed Data Registry would show type: data once we implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It carries 0 information though, right?
"data of type data", similar to "artifact of type artifact" means the same as not defining type at all. it's the most general thing there is (data even more than artifact maybe?), just super abstract. doubt users will use it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Maybe it's dataset instead of data then? Or anyways, if after subtracting plots, metrics and models everything that's left (among DVC PL inputs and outputs) is dataset, then I guess it's redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2cs. Data is not abstract for me (it's different vs model in my perception). But in DVC it's not needed. Any out w/o a specified type can be considered data.
I would personally try to simplify all of this - no multiple types initially. only models. We are prematurely generalizing this I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of the terms Data Registry, Model Registry, and Artifact Registry. I like having the type track to those names, so I like data.
Would love to keep the idea of data registry support around, haven't thought a lot about using gto for that, but certainly have thought about models and binaries.
I would love to use/try gto as an Artifact Registry for build artifacts - specifically compiled binaries. It might also be interesting to use as a Container Registry - there are lots of solutions in this space like Jfrog Artifactory, cloudsmith, GCP. If you wanted to be really fancy you could support some of those offerings as backends.
With all of that said I think prioritizing the Model Registry use case makes a lot of sense, a Data Registry would be my next priority. The solution seems like it may be general enough to support an Artifact Registry and Container Registry, might be worth doing some thinking about it and if it does make sense and there are advantages to keeping that in gto then keep that use case in mind while making the Model Registry experience awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to think of advantages of having gto also be a Container Registry and remembered that mlem can deploy docker containers, it would be nice to have the versioning of those container artifacts in gto and be able to reference with gto syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I might chip in as a user - I absolutely would like GTO to be used also to build a data(set) registry (together with DVC) and possibly a combined data(set) and model registry.
In fact, we intend to use GTO and DVC to build a dataset registry for one (fairly large) client in the coming weeks. By the way, while stuff like MLFlow is a viable alternative to the GTO+MLEM-based model registry (it does not have all of its features but has some others), I don't know any open source alternative to a GTO+DVC-based dataset registry (and general I've yet to see data versioning done better than with DVC)...which means it is a very good selling point IMO.
I like the idea of viewing GTO as a tool to build pretty much any artifact registry, though models and datasets are the most likely use-cases.
aguschin marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra for now, but was requested by users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! I would love to do this with plots too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think there is a user case this change may not support that well. One of our prospects asked to allow a single file (let's say mymodel.pkl) to be referenced as several GTO models (e.g. model1 and model2 - these are names). Since moving to DVC makes path essential (instead of name), I don't see how that feature would fit here. 🤔
The motivation is to be able to promote model1 and model2 to different stages at different moments of time separately. To clarify, let's assume there are two populations mymodel.pkl should be applied for. You can create stages like populationA-prod, populationA-staging and populationB-prod, populationB-staging, if you have many populations, this would make things cumbersome. The solution was to introduce model1 (for populationA) and model2 (for populationB). That required this feature.
The only workaround I see now is to create a "mirror file" with cp mymodel.pkl mymodel-for-populationB.pkl in some DVC PL stage. Or keep this name:path mapping outside of DVC somehow. Do you see any other solutions guys? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @omesser that it's related the discussion above. Take a look at the top-level plots schema, where plots may be identified by either path or an arbitrary name. Feels like following a similar syntax may be best here.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This will make them appear in DVC Model Registry: | |
| This will make them appear in [Studio Model Registry](https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry): |
(Fo now,. it's still called Studio and not DVC.Cloud or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall, let's keep the review scope to the level of ideas and user experience. We don't even know if this will be a separate page in DVC docs, or maybe we integrate it with some other page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is a nitpick 😉 but was hard for me to pass the opportunity and suggest
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Studio need this, or is solely to provide a CLI option to view the registry? I don't think the latter needs to be high priority unless I'm missing some use case where you need to access it from the CLI.
Not sure whether it can fit into dvc ls since the output is quite different (and potentially so are the arguments like --type). Need to think about whether we need this and where it can fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do we define an artifact as any output that has a type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some use case where you need to access it from the CLI
If user would like to download a model locally, but not quite sure which one at the moment, he might want to see this. E.g. I don't remember the model name, but know labels or remember description. This OFC can be solved via Studio, and if we want to push users for that, that's also a decision.
Another use case would be if you're investigating a repo that's not familiar to you (let's say your team has few repos or you look at another team's repo). Again, if we want to make people go to Studio every time for this, it a valid workflow, but IMO it makes you leave CLI and do extra things which can't be inconvenient.
By the way, do we define an artifact as any output that has a
type?
Either that, or any input/output file DVC keeps track of can be an artifact (without type in it's not defined). I think the latter is simpler and easier to convey. We can call it "file" instead of "artifact" I guess (if we're not going to introduce "compound" artifacts as we discussed before which I don't think is the case. Let's probably don't discuss this though, it's unrelated to this PR and not necessary at all now I believe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Not sure I see enough to make it a p1 yet. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do we define an artifact as any output that has a
type?Either that, or any input/output file DVC keeps track of can be an artifact (without
typein it's not defined). I think the latter is simpler and easier to convey.
It might also depend on the schema discussion below. If we have a model/registry/artifacts section of dvc.yaml, I guess we will only include what's specified there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we need 1 (concise) example setting all relevant fields
e.g.
dvc add models/mymodel.pkl --name def-detector --type model --description "glass defect image classifier" --label "algo=cnn" --label "owner=aguschin" --label "project=prod-qual-002"
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the stage tags always had some number like mymodel#prod#1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be mymodel#prod as well. It's called "simple" Git tag format and it's not the default one. https://mlem.ai/doc/gto/user-guide/#git-tags-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? Will it download all artifacts in the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should download the artifact, e.g. it will run dvc get . mymodel --rev $GITHUB_REF for a Git tag mymodel@v0.0.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docs to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Seeing new model versions pushed with DVC experiments | |
| ## Models and Experiments |
Also, a question - is this an implicit behavior for artifacts with type: model specifically? or will there be similar side effects for any artifact with "type" defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only updated MDP, so this is only for type: model for now. How did you assume this should work with other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't, I'm a bit concerned for implicit behaviors, we should probably find a way to give the user control over what to do with which artifacts on exp push. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What example of implicit behavior you have in mind? Like pushing a model that can be few GB in size? Not quite have specific examples in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, auto-pushing, exactly. maybe auto-versioning as well in the future (could be useful if running in the pipeline CI-CD as part of release. generate model, push it, and assign a version using GTO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should allow registration of unmerged experiments? Or maybe restrict what actions are available for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Don't have a strong opinion.
First, it's possible to do, so why not. We can also allow to click on "register", but then say something like "We advise to merge the experiment first" with buttons like "create a PR in GH" (default) and "register anyway".
We can prohibit registering (again, don't see a reason except for skipping polluting repo with dangling refs in a the workflow that requires users to merge experiments first).
We can delay answering this question for now I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we have an understanding how dvc exp push flow should look like on Studio's side? If that's still WIP, I guess we need to implement that first.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HL:
Suggest to add sub-headers / sections
And structure for this page will be something like this: