Skip to content

140-stage DVC pipeline hard to work with #9795

Open
@dotXem

Description

@dotXem

I have been advised by @daavoo to create this issue for better tracking of my challenges regarding the usage of DVC. I've put it under Feature Request as what we are looking for may not be possible for current DVC.

Here is the original post I created in the DVC forum detailing our needs.

Basically, our current pipeline is becoming quite big with 30 models and 140 dvc stage instances. We have 3 different stages: create_dataset, train_model, and compute_metrics. Consequently, we use foreach definitions of the 3 stages inside the dvc.yaml file to reduce the code duplication. Still, the params.yaml file is 1300 lines long, which is hard to work with.

Also, all the stage instances have the name "stage_name@number" (e.g., "train_model@0). The names do not hold useful information making the use of selective dvc repro -s hard to work with (which we use a lot). For instance, a common command we use would be dvc repro -s create_dataset@0 create_dataset@1 create_dataset@2 train_model@0 compute_metrics@0 to repro all the stage instances of a given model. To know what stage instances belong to the given model we want to repro, we need to look at the dvc.lock which is super tedious (3k-line long).

Ideally we are looking for a way to:

  1. split the params.yaml into smaller ones, each belonging to a given model
  2. have better stage instance namings, to better tell them apart

I think point 2. is doable by declaring the stage instances as follows in the params.yaml file:

create_dataset_list:
  model_1_trainset:
    script: create_dataset.py
    dataset_yaml: trainset.yaml
    folder_images: trainset_images
    params: trainset_params.py
    output: trainset.h5

However, as for point 1., I don't have any idea as importing yaml files into other ones is not possible AFAIK.

I have attached a minimal example to better show how our project is organized around DVC.

minimal_dvc.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: pipelinesRelated to the pipelines featurediscussionrequires active participation to reach a conclusionfeature requestRequesting a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions