Skip to content

AML Pipelines: add __str__() implementation or get_env_variable_name() to Dataset #1531

Open
@johan12345

Description

@johan12345

In the PipelineData class, there is a get_env_variable_name() method which returns the name of the environment variable for this dataset (e.g. "$AZUREML_DATAREFERENCE_my_pipelinedata"). This is actually also the __str__ implementation for PipelineData, so that it can easily be used in string formatting to pass it as an argument to a pipeline step, even if you use a custom format for arguments (such as the one of hydra.cc, as also mentioned in https://github.com/MicrosoftDocs/azure-docs/issues/66599):

my_pipelinedata = PipelineData("my_pipelinedata", datastore=datastore, is_directory=True)
train_step = PythonScriptStep(
    script_name="train.py",
    arguments=[
        f"dataset.path={my_pipelinedata}"
    ]
    # ...
)

Unfortunately, this is not the case if you want to consume a Dataset. The DatasetConsumptionConfig class does not provide a get_env_variable_name() method, and it doesn't have a custom __str__() implementation either. So, if you want to use it in string formatting for arguments, you have to manually construct the name of the environment variable, which is just a bit more code, but inconsistent with how it is done for PipelineData:

def as_env_variable(dataset):
    return f"${dataset.name}"

my_dataset = (
    Dataset.get_by_name(workspace, name="my_dataset")
    .as_named_input("my_dataset")
    .as_mount()
)
train_step = PythonScriptStep(
    script_name="train.py",
    arguments=[
        f"dataset1.path={as_env_variable(my_dataset)}",
        f"dataset2.path={my_pipelinedata}"
    ]
    # ...
)

So adding a __str__() implementation to the DatasetConsumptionConfig class, or at least a get_env_variable_name() function would make such code more consistent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions