Description
Passing of dependencies and outputs to a dvc process requires a quite a bit of boilerplate at the moment.
stage:
train:
cmd: >
python -m stages.train \
--training-data=data/training.json \
--validation-data=data/validation.json \
--model model/model.joblib \
--metrics=metrics/training.json
deps:
- data/training.json
- data/validation.json
outs:
- model/model.joblib
metrics:
- metrics/training.json
The example repos are inconsistent with when or which paths to pass as parameters or arguments. Also,
this is a bit awkward error prone, so we can use variables to make it more robust:
vars:
TRAINING_DATA_PATH: data/training.json
VALIDATION_DATA_PATH: data/validation.json
MODEL_PATH: model/model.joblib
TRAINING_METRICS_PATH: metrics/training.json
stage:
train:
cmd: >
python -m stages.train \
--training-data=${TRAINING_DATA_PATH} \
--validation-data=${VALIDATION_DATA_PATH} \
--model ${MODEL_PATH} \
--metrics=${TRAINING_METRICS_PATH}
deps:
- ${TRAINING_DATA_PATH}
- ${VALIDATION_DATA_PATH}
outs:
- ${MODEL_PATH}
metrics:
- ${TRAINING_METRICS_PATH}
This is more robust, but it starts getting very "boilerplatey".
My suggestion is to provide a environment to the stage, which passes variables to the stage command.
vars:
TRAINING_DATA_PATH: data/training.json
VALIDATION_DATA_PATH: data/validation.json
MODEL_PATH: model/model.joblib
TRAINING_METRICS_PATH: metrics/training.json
stage:
train:
cmd: python -m stages.train
env:
training_data: ${TRAINING_DATA_PATH}
validation_data: ${VALIDATION_DATA_PATH}
model: ${MODEL_PATH}
metrics: ${TRAINING_METRICS_PATH}
outs:
- ${MODEL_PATH}
metrics:
- ${TRAINING_METRICS_PATH}
this could make any parameters or variables be available to the stage without parameter passing:
model = train(os.environ['training_data'])
model.save(os.environ['model'])
...
And dvc can assume that any path not listed as an output or metric is a stage dependency.
A different option could be avoiding all parameter passing and adding some magic DVC environment:
vars:
TRAINING_DATA_PATH: data/training.json
VALIDATION_DATA_PATH: data/validation.json
MODEL_PATH: model/model.joblib
TRAINING_METRICS_PATH: metrics/training.json
stage:
train:
cmd: python -m stages.train
deps:
- ${TRAINING_DATA_PATH}
- ${VALIDATION_DATA_PATH}
outs:
- ${MODEL_PATH}
metrics:
- ${TRAINING_METRICS_PATH}
which could be used by the stage:
model = train(os.environ['DEPS_TRAINING_DATA_PATH'])
model.save(os.environ['OUTS_MODEL_PATH'])
...