Description
In the PipelineData
class, there is a get_env_variable_name()
method which returns the name of the environment variable for this dataset (e.g. "$AZUREML_DATAREFERENCE_my_pipelinedata"
). This is actually also the __str__
implementation for PipelineData
, so that it can easily be used in string formatting to pass it as an argument to a pipeline step, even if you use a custom format for arguments (such as the one of hydra.cc, as also mentioned in https://github.com/MicrosoftDocs/azure-docs/issues/66599):
my_pipelinedata = PipelineData("my_pipelinedata", datastore=datastore, is_directory=True)
train_step = PythonScriptStep(
script_name="train.py",
arguments=[
f"dataset.path={my_pipelinedata}"
]
# ...
)
Unfortunately, this is not the case if you want to consume a Dataset
. The DatasetConsumptionConfig
class does not provide a get_env_variable_name()
method, and it doesn't have a custom __str__()
implementation either. So, if you want to use it in string formatting for arguments, you have to manually construct the name of the environment variable, which is just a bit more code, but inconsistent with how it is done for PipelineData
:
def as_env_variable(dataset):
return f"${dataset.name}"
my_dataset = (
Dataset.get_by_name(workspace, name="my_dataset")
.as_named_input("my_dataset")
.as_mount()
)
train_step = PythonScriptStep(
script_name="train.py",
arguments=[
f"dataset1.path={as_env_variable(my_dataset)}",
f"dataset2.path={my_pipelinedata}"
]
# ...
)
So adding a __str__()
implementation to the DatasetConsumptionConfig
class, or at least a get_env_variable_name()
function would make such code more consistent.