Creating & updating dataset mirrors of Dandisets and their Zarrs for the
dandisets and
dandizarrs organizations is done with the
backups2datalad command in this repository.
Before running backups2datalad, the following setup must be performed:
-
backups2dataladmust be installed in a Python environment using eitherpip install .(run from a clone of this repository) orpip install git+https://github.com/dandi/backups2datalad. At least Python 3.10 is required. -
git-annex must be installed. At least version 10.20240430 is required, though you should endeavor to obtain the latest version.
-
An API token needs to be obtained for the DANDI instance that is being mirrored. When invoking
backups2datalad, the environment variableDANDI_API_KEYmust be set to the token. -
A configuration file should be written. This is a YAML file containing a mapping with the following keys:
-
dandi_instance— The name of the DANDI instance whose Dandisets should be mirrored. Defaults to"dandi". -
s3bucket— The name of the S3 bucket on which the assets for the DANDI instance are stored. Currently, only buckets in theus-east-1region are supported. Defaults to"dandiarchive".-
When
dandi_instanceis"dandi", this should be"dandiarchive". -
When
dandi_instanceis"dandi-staging", this should be"dandi-api-staging-dandisets".
-
-
s3endpoint— The base endpoint URL of the S3 instance on which the bucket is located. If this is set, the base bucket URL will be calculated as{s3endpoint}/{s3bucket}; otherwise, it will behttps://{s3bucket}.s3.amazonaws.com. This option is intended primarily for use in testing. -
content_url_regex— A regular expression used to identify which of an asset'scontentUrls is its S3 URL. Defaults to"amazonaws.com/.*blobs/". -
dandisets— A mapping containing configuration specific to the mirroring of Dandisets. If not given, it will default to a mapping in whichpathis set to"dandisets"and all other fields are unset.-
path(required) — The path to the local directory in which dataset mirrors of Dandisets will be placed, relative tobackup_root. The directory need not already exist.- This directory will be made into a DataLad dataset.
-
github_org— The name of the GitHub organization (which must already exist) to which the mirror repositories will be pushed. If not set, mirrors will not be pushed to GitHub.dandisets.github_organdzarrs.github_orgmust be either both set or both unset.
-
remote— Description of a git-annex special remote to create in new mirror repositories and for thepopulatesubcommand to copy data to. If not set,populatecannot be run.When present,
remoteis a mapping with the following keys:name(required) — The name of the remotetype(required) — The type of the remoteoptions(required) — A string-valued mapping specifying parameters to pass togit-annex initremote
-
-
zarrs— A mapping containing configuration specific to the mirroring of Zarrs. If not given,backups2dataladwill error upon trying to back up a Dandiset containing a Zarr. The mapping has the same schema as fordandisets.-
zarrs.pathwill not be made into a DataLad dataset. -
dandisets.github_organdzarrs.github_orgmust be either both set or both unset. -
zarrs.remoteis a prerequisite for thepopulate-zarrssubcommand.
-
-
backup_root— The path to the local directory in which the Dandiset and Zarr mirror directories will be placed. Defaults to the current directory.- This option can also be set via the
--backup-rootglobal CLI option, which overrides any value given in the configuration file.
- This option can also be set via the
-
asset_filter— A regular expression; if given, only assets whose paths match the regex will be processed.- This option can also be set via the
--asset-filteroption of theupdate-from-backupandreleasesubcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
jobs(integer) — The number of parallel git-annex jobs to use when downloading & pushing assets. Defaults to 10.- This option can also be set via the
--jobsglobal CLI option, which overrides any value given in the configuration file.
- This option can also be set via the
-
workers(integer) — The number of asynchronous worker tasks to run concurrently. Defaults to 5.- This option can also be set via the
--workersoption of theupdate-from-backup,backup-zarrs,populate, andpopulate-zarrssubcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
force— If set to"assets-update", all assets are forcibly updated, even those whose metadata hasn't changed.- This option can also be set via the
--forceoption of theupdate-from-backupandreleasesubcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
enable_tags(boolean) — Whether to enable creation of tags for releases; defaults totrue- This option can also be set via the
--tags/--no-tagsoptions of theupdate-from-backupsubcommand, which override any value given in the configuration file.
- This option can also be set via the
-
gc_assets(boolean) — If set andassets.jsoncontains any assets neither on the server nor in the backup, delete the extra assets instead of erroring. Defaults tofalse.- This option can also be set via the
--gc-assetsoption of theupdate-from-backupsubcommand, which overrides any value given in the configuration file.
- This option can also be set via the
-
mode— Specify how to decide whether to back up a Dandiset. Possible values are:-
"timestamp"(default) — only back up if the timestamp of the last backup is older than the "modified" timestamp on the server -
"force"— always back up -
"verify"— always back up, but error if there are any changes without a change to the "modified" timestamp
This option can also be set via the
--modeoption of theupdate-from-backupsubcommand, which overrides any value given in the configuration file. -
-
zarr_mode— Specify how to decide whether to back up a Zarr. Possible values are:-
"timestamp"(default) — only back up if the timestamp of the last backup is older than some Zarr entry in S3 -
"checksum"— only back up if the Zarr checksum is out of date or doesn't match the expected value -
"asset_checksum"— only back up if the Zarr asset's "modified" timestamp is later than that inassets.jsonand the checksum is out of date or doesn't match the expected value -
"force"— always back up
This option can also be set via the
--zarr-modeoption of theupdate-from-backupsubcommand, which overrides any value given in the configuration file. -
-
-
If pushing mirror repositories to GitHub, a GitHub access token with appropriate permissions must be provided via one of the following methods (in order of precedence):
- Set the
GITHUB_TOKENenvironment variable - Store the token in the
hub.oauthtokenkey of your~/.gitconfig
Additionally, an SSH key that has been registered with a GitHub account must be in use as well.
- Set the
Run backups2datalad with:
backups2datalad --config path/to/config/file <subcommand> ...
The environment variable DANDI_API_KEY must be set to an API token for the
DANDI instance being mirrored.
Run backups2datalad --help for details on the global options and summaries of
the subcommands.
backups2datalad subcommands:
-
update-from-backup— Create & update local mirrors of Dandisets and the Zarrs within them -
backup-zarrs— Create (but do not update) local mirrors of Zarrs for a single Dandiset -
update-github-metadata— Update homepages and descriptions for mirrors pushed to GitHub -
release— Create a tag (and a GitHub release, if pushing to GitHub) in a Dandiset mirror for a given published version -
populate— Copy assets from local Dandiset mirrors to the git-annex special remote -
populate-zarrs— Copy assets from local Zarr mirrors to the git-annex special remote -
zarr-checksum— Computes the Zarr checksum for a given Zarr mirror -
register-s3urls— Ensure that all blob assets in the backup of the given Dandiset have their S3 URLs registered with git-annex
Run backups2datalad <subcommand> --help for further details on each
subcommand.
The primary mirroring subcommands are update-from-backup, populate, and
populate-zarrs; the other subcommands are for minor/maintenance tasks and
usually do not need to be run.