Use pooch for dataset downloads (#244)#251
Merged
Merged
Conversation
Replace the custom requests-based downloader with pooch, aligning with the scverse ecosystem (cf. scverse/pertpy#980). pooch handles the transfer and on-disk caching; the existing FileLock guard, extracted- folder cache check, and multi-format archive extraction are preserved. The download progress bar is rendered by pooch via tqdm instead of rich, which is more robust in notebooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts in favor of the pooch-based downloader (#244): - _dataloader.py: keep pooch.retrieve + tqdm; drop the requests retry loop and env-var retry defaults that #250 added (retries are regressed away, pooch owns the transfer). max_retries/retry_delay remain in the signature for backwards compatibility but are unused. - CHANGELOG: keep both Unreleased entries; trim #250's entry to the CI dataset caching, which still applies (pooch short-circuits on existing files), since the retry/env-var behavior it described is gone. - Keep #250's CI cache workflows and conftest fixture change unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These configured the requests retry loop that pooch replaced; they are now no-ops. Remove them from the test and notebook CI workflows so pooch just handles the download. The dataset caching steps are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Get closer to pertpy's pooch downloader (scverse/pertpy#980): - Remove the unused max_retries/retry_delay params, the `del` statement, and their docstring lines from _download; pooch owns the transfer. - Drop the FileLock guard (and the filelock dependency); pertpy no longer uses it. pooch downloads to a unique temp file and moves atomically. - Remove the now-pointless CI retry kwargs from the diabetes_130 loaders. - Bump the CI dataset cache keys (v1 -> v2) so this PR actually downloads the datasets fresh through pooch instead of relying on the old cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With filelock gone and the cache cold (v2), tests that download the same dataset to the default path could extract concurrently under `-n auto`. Mark the colliding pairs with @pytest.mark.xdist_group and run CI with --dist loadgroup so they land on the same worker: - mimic-iv: test_mimic_iv_omop + the omop_connection_mimic_iv fixture test - physionet2012: test_physionet2012 + test_physionet2012_arguments - physionet2019: test_physionet2019 + test_physionet2019_arguments Register the xdist_group marker so runs without pytest-xdist don't warn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From testing the dataset loaders in a notebook: - Silence the one-time "IProgress not found" tqdm warning that pooch triggers by importing tqdm.auto at module load in a Jupyter kernel without ipywidgets; we render a plain-text bar, so the warning is noise. - Drop the "Moving files ..." info log (and the now-unused logger import) from the OMOP dataset setup; it was just noise on every fresh extract. - Fix ed.dt.physionet2019() on the full dataset: persons whose dynamic measurements all fall outside the observation window were dropped from the long->xarray pivot, leaving the time-series tensor with fewer rows than obs (40333 vs 40336) and raising a shape mismatch. Reindex the pivot onto all obs RecordIDs (in obs order) so missing persons are padded with NaN and the layer stays aligned with obs. Verified: full physionet2019 -> (40336, 35, 48); physionet2012/2019 (incl. subsampled) and gibleed regression tests pass; a notebook fresh download shows a clean tqdm bar with no ipywidgets warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Zethson
reviewed
Jun 4, 2026
| "requests", | ||
| "pooch", | ||
| "rich", | ||
| "tqdm", |
Member
There was a problem hiding this comment.
We don't need tqdm as dependency because rich itself can already render the progressbar.
https://github.com/scverse/pertpy/blob/main/pertpy/data/_dataloader.py#L12
Collaborator
Author
There was a problem hiding this comment.
Opted for a simpler progressbar with tqdm, and not rich
Member
There was a problem hiding this comment.
Like this we need a new dependency tho and it's not necessary
Zethson
reviewed
Jun 4, 2026
| from ehrdata._logger import logger | ||
|
|
||
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings("ignore", message="IProgress not found") |
Collaborator
Author
There was a problem hiding this comment.
an ugly warning is raised during import without this, since right now I use tqdm, instead of rich+ipywidgets
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #244.
Switches dataset downloads to pooch, aligning with the scverse ecosystem (cf. scverse/pertpy#980).
Changes
_dataloader.py: replaced the customrequests-based downloader withpooch.retrieve+pooch.HTTPDownloader.