Skip to content

fix ssh fsspec: make put atomic #10498

Open
@SebastianRiechert2

Description

@SebastianRiechert2

I am facing some problems working with a ssh-remote. We are only using dvc for dataset versioning.
We have little control over the remotes configuration and often when I push from local cache to remote, files arrive on remote corrupted. Multiple executions of dvc push do not fix the issue.
dvc status -c does not notice the corruption on remote even though the remote contains corrupted files and the local cache intact ones.
The problem is not noticed when working alone, since dvc does not detect the corruption on remote, thinks remote and local cache are in sync and dvc pull takes the files from the local cache.
The problem arises when a third party tries to pull the dataset.

  • When verification is turned off (in .dvc/config) dvc happily pulls the corrupted files,
  • if it is turned on, dvc fetch tells me the corrupted files are getting fetched to local cache, but they actually are not, as subsequent runs of dvc status -c reveal (all corrupt files from remote appear as deleted). dvc checkout (and dvc pull) subsequently fails, I think because of the mismatch between remote and local cache.

We are using dvc repro instead of manually adding files, so fixing the problem by manually untracking and readding the files is off the table as far as I understand.

Shouldn't there be an option to check for corruption when pushing files to remote? Something like a --verify option for dvc push.

Right now, verification is only happening during pulling, at which point the corrupted files cannot be automatically fixed by re-pulling/pushing anymore. Also, the error we get when trying to pull from the corrupted remote (without intact local cache) with verification turned on is not very helpful. It just asks if the cache is up to date. The cache is up to date, the remote is the thing causing problems.
We are currently testing a workaround, running rsync after each push. Since there are no dvc hooks (like there are git hooks), I do not see an elegant way of automating this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-syncRelated to dvc get/fetch/import/pull/pushfeature requestRequesting a new featurehelp wantedp1-importantImportant, aka current backlog of things to do

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions