Skip to content

Commit d4f256f

Browse files
selmanozleyenpre-commit-ci[bot]flying-sheep
authored
Adding a registry to have the hashes of datasets (restructured for aws s3) (#1076)
* init * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * linter errors * readthedocs fix * extension bug fix * use cache dir * all downloads cache to squidpy default. Don't use scanpy default since its relative. It's fine if set it globally * format * add docs * PathLike refactor * redirect notebooks to the correct module * update script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * since we have the hash of downloaded files we don't need to update for each new script * update script * format * remove agent spoofing * remove fallbacks * if path is not None * more structured logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve comments * fix logging import * replace all "from scanpy import logging as logg" with spatiadata loggers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clarify comments * use sc.settings * fix datapath * spatialdata logger doesn't accept time. will create an issue about logging * revert logging thing to put it to a separate issue * fix blunder * fix path comparison * remove fallback test * added comment about visium_hne_sdata and increased timeout * clarify docstring * remove fallback urls - I thought I already did :( * completely remove fallback from codebase * make registry thing clearer * update test_downloader * fix small mistake with file entry * remove unused functions like is_single_file property is_adata_with_image property has_hires_image property * rename to visium_10x for the format * raise an ExceptionGroup * explicit emptyness check * apply @flying-sheep's suggestion * apply previous suggestion to other places * apply Traversable suggestion * get rid of the first_entry thing * add cache test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update src/squidpy/datasets/_registry.py Co-authored-by: Philipp A. <[email protected]> * Update src/squidpy/datasets/_registry.py Co-authored-by: Philipp A. <[email protected]> * import pathlike * add cache to notebook ci's as well --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Philipp A. <[email protected]>
1 parent c653810 commit d4f256f

21 files changed

+1944
-662
lines changed

.github/workflows/test-notebooks.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,37 @@ concurrency:
1111
cancel-in-progress: true
1212

1313
jobs:
14+
ensure-data-is-cached:
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v5
18+
with:
19+
filter: blob:none
20+
fetch-depth: 0
21+
22+
- name: Install uv
23+
uses: astral-sh/setup-uv@v7
24+
with:
25+
enable-cache: false
26+
python-version: "3.13"
27+
28+
- name: Restore data cache
29+
id: data-cache
30+
uses: actions/cache@v4
31+
with:
32+
path: data # IMPORTANT: this will fail if scanpy.settings.datasetdir default changes
33+
key: data-${{ hashFiles('**/download_data.py') }}
34+
restore-keys: |
35+
data-
36+
enableCrossOsArchive: true
37+
38+
- name: Download datasets
39+
# Always run to ensure any missing files are downloaded
40+
# (restore-keys may provide partial cache)
41+
run: uvx hatch run data:download
42+
1443
test:
44+
needs: [ensure-data-is-cached]
1545
runs-on: ${{ matrix.os }}
1646
strategy:
1747
fail-fast: false
@@ -33,8 +63,20 @@ jobs:
3363
enable-cache: true
3464
python-version: ${{ matrix.python }}
3565
cache-dependency-glob: pyproject.toml
66+
67+
- name: Restore data cache
68+
id: data-cache
69+
uses: actions/cache@v4
70+
with:
71+
path: data # IMPORTANT: this will fail if scanpy.settings.datasetdir default changes
72+
key: data-${{ hashFiles('**/download_data.py') }}
73+
restore-keys: |
74+
data-
75+
enableCrossOsArchive: true
76+
3677
- name: Create notebooks environment
3778
run: uvx hatch -v env create notebooks
79+
3880
- name: Test notebooks
3981
env:
4082
MPLBACKEND: agg

.github/workflows/test.yaml

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -42,14 +42,15 @@ jobs:
4242
id: data-cache
4343
uses: actions/cache@v4
4444
with:
45-
path: |
46-
~/.cache/squidpy/*.h5ad
47-
~/.cache/squidpy/*.zarr
45+
path: data # IMPORTANT: this will fail if scanpy.settings.datasetdir default changes
4846
key: data-${{ hashFiles('**/download_data.py') }}
47+
restore-keys: |
48+
data-
4949
enableCrossOsArchive: true
5050

5151
- name: Download datasets
52-
if: steps.data-cache.outputs.cache-hit != 'true'
52+
# Always run to ensure any missing files are downloaded
53+
# (restore-keys may provide partial cache)
5354
run: uvx hatch run data:download
5455

5556
# Get the test environment from hatch as defined in pyproject.toml.
@@ -122,10 +123,10 @@ jobs:
122123
id: data-cache
123124
uses: actions/cache@v4
124125
with:
125-
path: |
126-
~/.cache/squidpy/*.h5ad
127-
~/.cache/squidpy/*.zarr
126+
path: data # IMPORTANT: this will fail if scanpy.settings.datasetdir default changes
128127
key: data-${{ hashFiles('**/download_data.py') }}
128+
restore-keys: |
129+
data-
129130
enableCrossOsArchive: true
130131

131132
- name: System dependencies (Linux)
@@ -181,10 +182,10 @@ jobs:
181182
id: coverage-data-cache
182183
uses: actions/cache@v4
183184
with:
184-
path: |
185-
~/.cache/squidpy/*.h5ad
186-
~/.cache/squidpy/*.zarr
185+
path: data # IMPORTANT: this will fail if scanpy.settings.datasetdir default changes
187186
key: data-${{ hashFiles('**/download_data.py') }}
187+
restore-keys: |
188+
data-
188189
enableCrossOsArchive: true
189190

190191
- name: System dependencies (Linux)

.scripts/ci/download_data.py

Lines changed: 42 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,70 @@
11
#!/usr/bin/env python3
2+
"""Download datasets to populate CI cache.
3+
4+
This script downloads all datasets that tests might need.
5+
The downloader handles caching to scanpy.settings.datasetdir.
6+
"""
7+
28
from __future__ import annotations
39

410
import argparse
5-
from pathlib import Path
6-
from typing import Any
711

8-
from squidpy.datasets import visium_hne_sdata
12+
from scanpy import settings
13+
from spatialdata._logging import logger
914

1015
_CNT = 0 # increment this when you want to rebuild the CI cache
11-
_ROOT = Path.home() / ".cache" / "squidpy"
12-
13-
14-
def _print_message(func_name: str, path: Path, *, dry_run: bool = False) -> None:
15-
prefix = "[DRY RUN]" if dry_run else ""
16-
if path.is_file():
17-
print(f"{prefix}[Loading] {func_name:>25} <- {str(path):>25}")
18-
else:
19-
print(f"{prefix}[Downloading] {func_name:>25} -> {str(path):>25}")
20-
21-
22-
def _maybe_download_data(func_name: str, path: Path) -> Any:
23-
import squidpy as sq
24-
25-
try:
26-
return getattr(sq.datasets, func_name)(path=path)
27-
except Exception as e: # noqa: BLE001
28-
print(f"File {str(path):>25} seems to be corrupted: {e}. Removing and retrying")
29-
path.unlink()
30-
31-
return getattr(sq.datasets, func_name)(path=path)
3216

3317

3418
def main(args: argparse.Namespace) -> None:
3519
from anndata import AnnData
3620

3721
import squidpy as sq
22+
from squidpy.datasets._downloader import get_downloader
3823

39-
all_datasets = sq.datasets._dataset.__all__ + sq.datasets._image.__all__
40-
all_extensions = ["h5ad"] * len(sq.datasets._dataset.__all__) + ["tiff"] * len(sq.datasets._image.__all__)
24+
downloader = get_downloader()
25+
registry = downloader.registry
26+
27+
# Visium samples tested in CI
28+
visium_samples_to_cache = [
29+
"V1_Mouse_Kidney",
30+
"Targeted_Visium_Human_SpinalCord_Neuroscience",
31+
"Visium_FFPE_Human_Breast_Cancer",
32+
]
4133

4234
if args.dry_run:
43-
for func_name, ext in zip(all_datasets, all_extensions):
44-
if func_name == "visium_hne_sdata":
45-
ext = "zarr"
46-
path = _ROOT / f"{func_name}.{ext}"
47-
_print_message(func_name, path, dry_run=True)
35+
logger.info("Cache: %s", settings.datasetdir)
36+
logger.info(
37+
"Would download: %d AnnData, %d images, %d SpatialData, %d Visium",
38+
len(registry.anndata_datasets),
39+
len(registry.image_datasets),
40+
len(registry.spatialdata_datasets),
41+
len(visium_samples_to_cache),
42+
)
4843
return
4944

50-
# could be parallelized, but on CI it largely does not matter (usually limited to 2 cores + bandwidth limit)
51-
for func_name, ext in zip(all_datasets, all_extensions):
52-
if func_name == "visium_hne_sdata":
53-
ext = "zarr"
54-
path = _ROOT / f"{func_name}.{ext}"
55-
56-
_print_message(func_name, path)
57-
obj = visium_hne_sdata(_ROOT)
45+
# Download all datasets - the downloader handles caching
46+
for name in registry.anndata_datasets:
47+
obj = getattr(sq.datasets, name)()
48+
assert isinstance(obj, AnnData)
5849

59-
assert path.is_dir(), f"Expected a .zarr folder at {path}"
60-
continue
50+
for name in registry.image_datasets:
51+
obj = getattr(sq.datasets, name)()
52+
assert isinstance(obj, sq.im.ImageContainer)
6153

62-
path = _ROOT / f"{func_name}.{ext}"
63-
_print_message(func_name, path)
64-
obj = _maybe_download_data(func_name, path)
54+
for name in registry.spatialdata_datasets:
55+
getattr(sq.datasets, name)()
6556

66-
# we could do without the AnnData check as well (1 less req. in tox.ini), but it's better to be safe
67-
assert isinstance(obj, AnnData | sq.im.ImageContainer), type(obj)
68-
assert path.is_file(), path
57+
for sample in visium_samples_to_cache:
58+
obj = sq.datasets.visium(sample, include_hires_tiff=True)
59+
assert isinstance(obj, AnnData)
6960

7061

7162
if __name__ == "__main__":
72-
parser = argparse.ArgumentParser(description="Download data used for tutorials/examples.")
63+
parser = argparse.ArgumentParser(description="Download datasets to populate CI cache.")
7364
parser.add_argument(
74-
"--dry-run", action="store_true", help="Do not download any data, just print what would be downloaded."
65+
"--dry-run",
66+
action="store_true",
67+
help="Do not download, just print what would be downloaded.",
7568
)
7669

7770
main(parser.parse_args())

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ dependencies = [
6060
"omnipath>=1.0.7",
6161
"pandas>=2.1",
6262
"pillow>=8",
63+
"pooch>=1.6",
64+
"pyyaml>=6",
6365
"scanpy>=1.9.3",
6466
"scikit-image>=0.25",
6567
# due to https://github.com/scikit-image/scikit-image/issues/6850 breaks rescale ufunc

0 commit comments

Comments
 (0)