Description
I have been advised by @daavoo to create this issue for better tracking of my challenges regarding the usage of DVC. I've put it under Feature Request
as what we are looking for may not be possible for current DVC.
Here is the original post I created in the DVC forum detailing our needs.
Basically, our current pipeline is becoming quite big with 30 models and 140 dvc stage instances. We have 3 different stages: create_dataset
, train_model
, and compute_metrics
. Consequently, we use foreach
definitions of the 3 stages inside the dvc.yaml
file to reduce the code duplication. Still, the params.yaml
file is 1300 lines long, which is hard to work with.
Also, all the stage instances have the name "stage_name@number" (e.g., "train_model@0). The names do not hold useful information making the use of selective dvc repro -s
hard to work with (which we use a lot). For instance, a common command we use would be dvc repro -s create_dataset@0 create_dataset@1 create_dataset@2 train_model@0 compute_metrics@0
to repro all the stage instances of a given model. To know what stage instances belong to the given model we want to repro, we need to look at the dvc.lock
which is super tedious (3k-line long).
Ideally we are looking for a way to:
- split the
params.yaml
into smaller ones, each belonging to a given model - have better stage instance namings, to better tell them apart
I think point 2. is doable by declaring the stage instances as follows in the params.yaml
file:
create_dataset_list:
model_1_trainset:
script: create_dataset.py
dataset_yaml: trainset.yaml
folder_images: trainset_images
params: trainset_params.py
output: trainset.h5
However, as for point 1., I don't have any idea as importing yaml files into other ones is not possible AFAIK.
I have attached a minimal example to better show how our project is organized around DVC.