Description
I am facing some problems working with a ssh-remote. We are only using dvc for dataset versioning.
We have little control over the remotes configuration and often when I push from local cache to remote, files arrive on remote corrupted. Multiple executions of dvc push
do not fix the issue.
dvc status -c
does not notice the corruption on remote even though the remote contains corrupted files and the local cache intact ones.
The problem is not noticed when working alone, since dvc does not detect the corruption on remote, thinks remote and local cache are in sync and dvc pull
takes the files from the local cache.
The problem arises when a third party tries to pull the dataset.
- When verification is turned off (in .dvc/config) dvc happily pulls the corrupted files,
- if it is turned on, dvc fetch tells me the corrupted files are getting fetched to local cache, but they actually are not, as subsequent runs of
dvc status -c
reveal (all corrupt files from remote appear as deleted).dvc checkout
(anddvc pull
) subsequently fails, I think because of the mismatch between remote and local cache.
We are using dvc repro
instead of manually adding files, so fixing the problem by manually untracking and readding the files is off the table as far as I understand.
Shouldn't there be an option to check for corruption when pushing files to remote? Something like a --verify
option for dvc push
.
Right now, verification is only happening during pulling, at which point the corrupted files cannot be automatically fixed by re-pulling/pushing anymore. Also, the error we get when trying to pull from the corrupted remote (without intact local cache) with verification turned on is not very helpful. It just asks if the cache is up to date. The cache is up to date, the remote is the thing causing problems.
We are currently testing a workaround, running rsync after each push. Since there are no dvc hooks (like there are git hooks), I do not see an elegant way of automating this.