Skip to content

chanzuckerberg/biohub-data-cli

Repository files navigation

data-cli

CI Coverage PyPI Python

Command-line tool for downloading datasets published by CZ Biohub. Resolves a collection ID to its constituent datasets and downloads files from S3 and HTTP, with progress bars, size estimates, and dry-run accounting.

Installation

To install the OPS data CLI, run:

pip install biohub-data-cli

Quick start

See what a collection contains without downloading:

ops-data download collection <collection-id> --dry-run

Download a collection to the current directory:

ops-data download collection <collection-id>

Download multiple collections to a specific directory, skipping the prompt:

ops-data download collection <id-a> <id-b> -o ./data -y

Download only specific datasets from a collection:

ops-data download collection <collection-id> --dataset dataset-1,dataset-2

Files land under <outdir>/<collection-slug>/<dataset-slug>/.

Commands

ops-data download collection IDS...

Download one or more collections by ID.

Option Description
-o, --outdir PATH Output directory. Defaults to ..
-y, --yes Skip the size-estimate confirmation prompt.
--dataset SLUGS Comma-separated dataset slugs to download a subset of the collection. Only valid with a single collection.
--dry-run Print per-dataset size statistics without downloading. Mutually exclusive with -y.
--no-resume Ignore cached listing state and re-list/re-download from scratch.

Dry run resolves every S3 URI (listing prefixes, heading objects) to report exact byte totals per dataset. HTTP URLs are not sized during dry run and surface as a warning in the summary.

Filtering datasets with --dataset downloads only the named datasets from a collection instead of all of them, e.g. --dataset dataset-1,dataset-2. Slugs are downloaded in the order given, duplicates are ignored, and an unknown slug fails with the list of available slugs. Run --dry-run first to see the available slugs. Filtering applies to a single collection, so it can't be combined with multiple IDs.

Confirmation prompt shows the aggregate size estimate before any bytes move. Pass -y to skip it in scripts.

Failures are collected and reported at the end. The process exits non-zero if any download failed, but other downloads continue — one bad URL won't abort the run.

Development

This project uses uv for dependency management.

Install dependencies (including dev extras):

uv sync

Run tests:

uv run pytest

Run tests with coverage report:

uv run pytest --cov=biohub_data_cli --cov-report=term-missing

Run the CLI from a checkout:

uv run ops-data --help

Integration tests

Tests marked integration hit real S3 buckets / HTTP servers and are deselected by default. Run them explicitly:

uv run pytest -m integration

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.

Reporting Security Issues

If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages