Skip to content

stage environment #9843

Open
Open
@janrito

Description

@janrito

Passing of dependencies and outputs to a dvc process requires a quite a bit of boilerplate at the moment.

stage:
  train:
    cmd: > 
      python -m stages.train \
      --training-data=data/training.json \
      --validation-data=data/validation.json \
      --model model/model.joblib \
      --metrics=metrics/training.json
    deps:
      - data/training.json
      - data/validation.json
    outs:
      - model/model.joblib
    metrics:
      - metrics/training.json

The example repos are inconsistent with when or which paths to pass as parameters or arguments. Also,
this is a bit awkward error prone, so we can use variables to make it more robust:

vars:
  TRAINING_DATA_PATH: data/training.json
  VALIDATION_DATA_PATH: data/validation.json
  MODEL_PATH: model/model.joblib
  TRAINING_METRICS_PATH: metrics/training.json
stage:
  train:
    cmd: > 
      python -m stages.train \
      --training-data=${TRAINING_DATA_PATH} \
      --validation-data=${VALIDATION_DATA_PATH} \
      --model ${MODEL_PATH} \
      --metrics=${TRAINING_METRICS_PATH}
    deps:
      - ${TRAINING_DATA_PATH}
      - ${VALIDATION_DATA_PATH}
    outs:
      - ${MODEL_PATH}
    metrics:
      - ${TRAINING_METRICS_PATH}

This is more robust, but it starts getting very "boilerplatey".

My suggestion is to provide a environment to the stage, which passes variables to the stage command.

vars:
  TRAINING_DATA_PATH: data/training.json
  VALIDATION_DATA_PATH: data/validation.json
  MODEL_PATH: model/model.joblib
  TRAINING_METRICS_PATH: metrics/training.json
stage:
  train:
    cmd: python -m stages.train 
    env:
      training_data: ${TRAINING_DATA_PATH}
      validation_data: ${VALIDATION_DATA_PATH}
      model: ${MODEL_PATH}
      metrics: ${TRAINING_METRICS_PATH}
    outs:
      - ${MODEL_PATH}
    metrics:
      - ${TRAINING_METRICS_PATH}

this could make any parameters or variables be available to the stage without parameter passing:

model = train(os.environ['training_data'])
model.save(os.environ['model'])
...

And dvc can assume that any path not listed as an output or metric is a stage dependency.

A different option could be avoiding all parameter passing and adding some magic DVC environment:

vars:
  TRAINING_DATA_PATH: data/training.json
  VALIDATION_DATA_PATH: data/validation.json
  MODEL_PATH: model/model.joblib
  TRAINING_METRICS_PATH: metrics/training.json
stage:
  train:
    cmd: python -m stages.train 
    deps:
      - ${TRAINING_DATA_PATH}
      - ${VALIDATION_DATA_PATH}
    outs:
      - ${MODEL_PATH}
    metrics:
      - ${TRAINING_METRICS_PATH}

which could be used by the stage:

model = train(os.environ['DEPS_TRAINING_DATA_PATH'])
model.save(os.environ['OUTS_MODEL_PATH'])
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: templatingRelated to the templating featurediscussionrequires active participation to reach a conclusionfeature requestRequesting a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions