Skip to content

Use pooch for dataset downloads (#244)#251

Merged
eroell merged 8 commits into
mainfrom
feature/use-pooch-for-downloads
Jun 4, 2026
Merged

Use pooch for dataset downloads (#244)#251
eroell merged 8 commits into
mainfrom
feature/use-pooch-for-downloads

Conversation

@eroell

@eroell eroell commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Closes #244.

Switches dataset downloads to pooch, aligning with the scverse ecosystem (cf. scverse/pertpy#980).

Changes

  • _dataloader.py: replaced the custom requests-based downloader with pooch.retrieve + pooch.HTTPDownloader.
  • Progress is rendered by pooch via tqdm.

eroell and others added 8 commits June 3, 2026 21:34
Replace the custom requests-based downloader with pooch, aligning with
the scverse ecosystem (cf. scverse/pertpy#980). pooch handles the
transfer and on-disk caching; the existing FileLock guard, extracted-
folder cache check, and multi-format archive extraction are preserved.
The download progress bar is rendered by pooch via tqdm instead of rich,
which is more robust in notebooks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts in favor of the pooch-based downloader (#244):
- _dataloader.py: keep pooch.retrieve + tqdm; drop the requests retry
  loop and env-var retry defaults that #250 added (retries are regressed
  away, pooch owns the transfer). max_retries/retry_delay remain in the
  signature for backwards compatibility but are unused.
- CHANGELOG: keep both Unreleased entries; trim #250's entry to the CI
  dataset caching, which still applies (pooch short-circuits on existing
  files), since the retry/env-var behavior it described is gone.
- Keep #250's CI cache workflows and conftest fixture change unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These configured the requests retry loop that pooch replaced; they are
now no-ops. Remove them from the test and notebook CI workflows so pooch
just handles the download. The dataset caching steps are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Get closer to pertpy's pooch downloader (scverse/pertpy#980):
- Remove the unused max_retries/retry_delay params, the `del` statement,
  and their docstring lines from _download; pooch owns the transfer.
- Drop the FileLock guard (and the filelock dependency); pertpy no longer
  uses it. pooch downloads to a unique temp file and moves atomically.
- Remove the now-pointless CI retry kwargs from the diabetes_130 loaders.
- Bump the CI dataset cache keys (v1 -> v2) so this PR actually downloads
  the datasets fresh through pooch instead of relying on the old cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With filelock gone and the cache cold (v2), tests that download the same
dataset to the default path could extract concurrently under `-n auto`.
Mark the colliding pairs with @pytest.mark.xdist_group and run CI with
--dist loadgroup so they land on the same worker:
- mimic-iv: test_mimic_iv_omop + the omop_connection_mimic_iv fixture test
- physionet2012: test_physionet2012 + test_physionet2012_arguments
- physionet2019: test_physionet2019 + test_physionet2019_arguments

Register the xdist_group marker so runs without pytest-xdist don't warn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From testing the dataset loaders in a notebook:
- Silence the one-time "IProgress not found" tqdm warning that pooch
  triggers by importing tqdm.auto at module load in a Jupyter kernel
  without ipywidgets; we render a plain-text bar, so the warning is noise.
- Drop the "Moving files ..." info log (and the now-unused logger import)
  from the OMOP dataset setup; it was just noise on every fresh extract.
- Fix ed.dt.physionet2019() on the full dataset: persons whose dynamic
  measurements all fall outside the observation window were dropped from
  the long->xarray pivot, leaving the time-series tensor with fewer rows
  than obs (40333 vs 40336) and raising a shape mismatch. Reindex the
  pivot onto all obs RecordIDs (in obs order) so missing persons are
  padded with NaN and the layer stays aligned with obs.

Verified: full physionet2019 -> (40336, 35, 48); physionet2012/2019
(incl. subsampled) and gibleed regression tests pass; a notebook fresh
download shows a clean tqdm bar with no ipywidgets warning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@eroell eroell merged commit 2d1d15d into main Jun 4, 2026
14 checks passed
@eroell eroell deleted the feature/use-pooch-for-downloads branch June 4, 2026 07:35
Comment thread pyproject.toml
"requests",
"pooch",
"rich",
"tqdm",

@Zethson Zethson Jun 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need tqdm as dependency because rich itself can already render the progressbar.

https://github.com/scverse/pertpy/blob/main/pertpy/data/_dataloader.py#L12

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted for a simpler progressbar with tqdm, and not rich

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this we need a new dependency tho and it's not necessary

from ehrdata._logger import logger

with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="IProgress not found")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really do this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an ugly warning is raised during import without this, since right now I use tqdm, instead of rich+ipywidgets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use pooch for dataset downloads

2 participants