-
Notifications
You must be signed in to change notification settings - Fork 2
Description
This is a braindump on how to turn this tool into a library.
Goals:
- Users should be able to control the policy of the transfer (like Globus's "sync level" https://docs.globus.org/cli/reference/transfer/).
- Users should be able to inject arbitrary steps into the transfer DAG (e.g., archiving or processing steps, or even something like pull-process-push).
It's not too hard to imagine how the first goal might work: generate a list of "sync levels", triggered by flags passed through the initial command line invocation.
The second is much trickier, primarily because it is architecturally difficult to have programmatic control over the "inner" DAG. Currently, the file transfer tool is implemented as an "outer" DAG that
- Generates the file diff between local and remote
- Writes the inner DAG based on that diff
- Runs the inner DAG
- Verifies the integrity of the transfer manifest
The important thing to note is that none of these steps happen in the process that the user originally invoked, and none of them even happen in the same process as each other. This makes it difficult to exert programmatic control directly. Worse, we will constantly be fighting against the natural control path of the system, which is that DAGMan is running everything, and components pass information to each other in bespoke ways.
A plugin architecture may therefore be easier to implement than a callable library in terms of executing the entire transfer process; we might be able to still expose some of the internal for re-use by other software, but it will be difficult to expose the entire workflow to programmatic control. Even easier is a "menu of options" provided by the tool itself, with neither plugins nor an external programmatic API, but that may make some future use cases harder. I would encourage hard thinking about whether other tools really, really need to be integrated with the transfer tool, or whether they can run separately. rsync doesn't have plugins...
Here are some options for moving ahead with librarification, not-necessarily mutually exclusive, and obviously not exhaustive:
Option 1: Pickle Everything
The initial process takes function references for certain internals, pickles them, and re-uses those pickles later. For example, the user could pass a list of functions that mutate the inner DAG, and the inner-DAG-writer will then call them in order. This keeps overall control inside the library but exposes a plugin-point for users to inject new behavior.
The obvious problem is that we need a generic, powerful interface for the plugins to be useful. I'm not sure what that interface would be.
Option 2: Expose Internals
We can expose the individual layers of the inner DAG as re-usable components that can be inserted into other DAGs. Likely, we would define the layers up to some vars. Signature might be something like
def add_transfer_layer(dag: DAG, remote_to_local: Dict[Path, Path], direction: TransferDirection) -> NodeLayer: ...which would add a layer to the given DAG that transfers files according to the mapping and the transfer direction (push or pull). We could verify in the postscript, or have a separate add_verify_layer function, and expose extra options like retries. This function would presumably be fed the results of the file manifest diff, which is another thing we could expose.
The problem with this option is that it would be very hard to run the original workflow - you would need to write your own workflow wrapper around the raw DAG it helps you build. The current workflow is very tightly coupled to the structure of the script, particularly in how it is "all-singing, all-dancing" - entirely self-contained in a single script without any real external dependencies.