Command-line tool for downloading datasets published by CZ Biohub. Resolves a collection ID to its constituent datasets and downloads files from S3 and HTTP, with progress bars, size estimates, and dry-run accounting.
To install the OPS data CLI, run:
pip install biohub-data-cliSee what a collection contains without downloading:
ops-data download collection <collection-id> --dry-runDownload a collection to the current directory:
ops-data download collection <collection-id>Download multiple collections to a specific directory, skipping the prompt:
ops-data download collection <id-a> <id-b> -o ./data -yDownload only specific datasets from a collection:
ops-data download collection <collection-id> --dataset dataset-1,dataset-2Files land under <outdir>/<collection-slug>/<dataset-slug>/.
Download one or more collections by ID.
| Option | Description |
|---|---|
-o, --outdir PATH |
Output directory. Defaults to .. |
-y, --yes |
Skip the size-estimate confirmation prompt. |
--dataset SLUGS |
Comma-separated dataset slugs to download a subset of the collection. Only valid with a single collection. |
--dry-run |
Print per-dataset size statistics without downloading. Mutually exclusive with -y. |
--no-resume |
Ignore cached listing state and re-list/re-download from scratch. |
Dry run resolves every S3 URI (listing prefixes, heading objects) to report exact byte totals per dataset. HTTP URLs are not sized during dry run and surface as a warning in the summary.
Filtering datasets with --dataset downloads only the named datasets from a collection instead of all of them, e.g. --dataset dataset-1,dataset-2. Slugs are downloaded in the order given, duplicates are ignored, and an unknown slug fails with the list of available slugs. Run --dry-run first to see the available slugs. Filtering applies to a single collection, so it can't be combined with multiple IDs.
Confirmation prompt shows the aggregate size estimate before any bytes move. Pass -y to skip it in scripts.
Failures are collected and reported at the end. The process exits non-zero if any download failed, but other downloads continue — one bad URL won't abort the run.
This project uses uv for dependency management.
Install dependencies (including dev extras):
uv syncRun tests:
uv run pytestRun tests with coverage report:
uv run pytest --cov=biohub_data_cli --cov-report=term-missingRun the CLI from a checkout:
uv run ops-data --helpTests marked integration hit real S3 buckets / HTTP servers and are deselected by default. Run them explicitly:
uv run pytest -m integrationThis project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.
If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.