Description
We currently support a so-called "external outputs" scenario, that is based on creating separate external cache locations for each type of the output (e.g. s3, ssh etc). This scenario is unpolished and and even straight broken in some scenarios and is also constantly being misused with an intention of importing files from remote locations to local cache/workspace or remote. People often don't realise that this effectively extends your workspace outside of your local repo and needs proper isolation, in order to not run into conflict with your collegue running dvc checkout
while you are working on the file/dir on external workspace.
This makes me think that we need to introduce proper terminology and abstraction to clearly state what this is all about and make this powerful feature usable. The solution is to introduce a concept of "workspace"s. It could possibly look something like this:
*** DRAFT ***
- Define the external workspace you want to attach:
dvc workspace add myssh ssh://example.com/home/efiop
Now, unless explicitly configured otherwise, dvc will assume that you want to use ssh://example.com/home/efiop/.dvc/cache
as a default cache for artifacts in that workspace. Similar to your local repo.
- Use your workspace:
dvc add ssh://example.com/home/efiop/data
dvc run -o ssh://example.com/home/efiop/model.pkl ...
or with a special workspace-notation (similar to a so-called remote-notation that we currently have: remote://myremote/path)
dvc add ws://myssh/data
dvc run -o ws://myssh/model.pkl
This notation is nice because it allows you to redefine your workspaces (e.g. for each coworker to use his own home directory on the server) pretty easily (plus the config options can be set once in the config section for that workspace).
Current cache.local/ssh/s3/gs/etc
sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.
CC @PeterFogh , would appreciate your thoughts on this 🙂