Skip to content

Add an option to not follow symbolic links when hashing directory, or an option to ignore broken symlinks #9971

Open
@jnareb

Description

@jnareb

Justification for the feature

I am working on a machine learning project belonging to the Mining Software Repositories (MSR) field. One of the stages is cloning the repository. With DVC I have tried to add cloned repository (or rather repository with a set of cloned repositories) as external output (not cached). The contents of the cloned repository, its checked out files, is not something under the control of the cloning stage.

With this setup, when I run dvc repro or dvc stage, DVC fails with the following error:

ERROR: unexpected error - [Errno 2] No such file or directory:
 '/mnt/data/repositories/freeradius-server/src/tests/salt-test-server/salt/ldap/freeradius.ldif'

The error message is a bit misleading: the file mentioned in the error message do exist, it is just a broken symbolic link. The file in question exists, the file it references does not.

Option to not follow symbolic links

I propose for DVC to have an option to not follow symbolic links when hashing directories or files. With such option (whether it is named --no-follow-symlinks, or --no-dereference, or something else), DVC would hash the contents of the symbolic link, and not the file it points to.

By the way, as far as I understand it, this is the default behavior for 'tar' and similar tools.

If this option / feature is enabled, DVC would have to check if the file is a symbolic link with os.path.islink() or Path.is_symlink(), and then instead of using open to read its contents (or rather the contents of the file it points to), use os.readlink() or Path.readlink().

This could be encapsulated in a custom open function and/or context manager.

Option to ignore broken symbolic links

If this option / feature is enabled, DVC would catch an exception that occurs when trying to open a broken symbolic link, extract the path to the file that is being opened or remember it, and if the file (the symbolic link) exists, it would ignore the error.

This I think is more backward-compatibile solution, and it might be easier to implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-managementRelated to dvc add/checkout/commit/move/removefeature requestRequesting a new featurep3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions