Skip to content

Fix download logic#230

Merged
jemrobinson merged 23 commits intomainfrom
229-fix-download-logic
Feb 27, 2026
Merged

Fix download logic#230
jemrobinson merged 23 commits intomainfrom
229-fix-download-logic

Conversation

@jemrobinson
Copy link
Copy Markdown
Member

Updated create logic:

  1. if overwrite then delete existing dataset and skip to (4)
  2. if the path exists and download is in progress then end here
  3. if the path exists and inspect fails then delete existing dataset
  4. run download

Updated download logic:

  1. initialise if the path does not exist
  2. load, which should automatically check for pre-existing chunks
  3. if status is good, finalise

Removed:

  • custom lock file tracking of downloaded parts
  • manual init/load/finalise logic
  • unused tests

Closes #229

@jemrobinson jemrobinson requested a review from IFenton February 26, 2026 11:36
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 26, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  icenet_mp/cli
  datasets.py 43-46
  icenet_mp/data_processors
  data_downloader.py 55-92, 96-107, 115-121, 125-144, 154-166, 170, 179-198
  icenet_mp/data_processors/preprocessors
  ipreprocessor.py
  icenet_mp/types
  __init__.py
  simple_datatypes.py
Project Total  

This report was generated by python-coverage-comment-action

@jemrobinson jemrobinson force-pushed the 229-fix-download-logic branch from 2c868b7 to bdfb08b Compare February 26, 2026 16:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies the dataset download logic by removing custom file-locking and progress tracking in favor of delegating to the anemoi-datasets library's built-in chunked loading capabilities. The changes address issue #229 which reported errors when using load_in_parts for downloading datasets.

Changes:

  • Simplified download logic to use anemoi-datasets' built-in chunk handling and progress tracking
  • Removed custom file-locking implementation and manual part tracking
  • Removed CLI commands init, load, and load_in_parts in favor of a unified create command
  • Upgraded anemoi-datasets dependency to >=0.5.27 to get necessary fixes

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
icenet_mp/data_processors/data_downloader.py Replaced manual load_in_parts logic with simplified create/download flow using anemoi-datasets' built-in capabilities; removed file-locking and manual progress tracking
icenet_mp/cli/datasets.py Removed init, load, and load_in_parts CLI commands; simplified inspect command
icenet_mp/types/simple_datatypes.py Removed AnemoiCreateArgs dataclass; reordered remaining types alphabetically
icenet_mp/types/init.py Removed AnemoiCreateArgs export
tests/conftest.py Added MockAnemoiCreateArgs class to replace removed AnemoiCreateArgs in test fixtures
tests/test_cli.py Updated CLI help test expectations to remove defunct commands
tests/data_processors/test_data_downloader.py Removed all tests for the old load_in_parts implementation (435+ lines removed)
pyproject.toml Removed filelock dependency; upgraded anemoi-datasets to >=0.5.27
uv.lock Reflected dependency changes from pyproject.toml
Comments suppressed due to low confidence (1)

tests/data_processors/test_data_downloader.py:67

  • All tests for the download functionality have been removed, leaving the new create(), download(), initialise(), status(), and load_in_chunks() methods completely untested. The fixtures downloader_with_file_dataset and downloader_with_directory_dataset are defined but unused. Consider adding tests for the new download logic to verify correct behavior in various scenarios: successful download, resuming partial downloads, handling corrupted datasets, overwrite functionality, and the status checking logic.
from pathlib import Path

import pytest
from omegaconf import DictConfig, OmegaConf

from icenet_mp.data_processors.data_downloader import DataDownloader
from icenet_mp.data_processors.preprocessors.ipreprocessor import IPreprocessor


class DummyPreprocessor(IPreprocessor):
    """A dummy preprocessor for testing."""

    def download(self, preprocessor_path: Path) -> None:  # pragma: no cover - unused
        preprocessor_path.mkdir(parents=True, exist_ok=True)


def _build_downloader(tmp_path: Path, dataset_cfg: dict) -> DataDownloader:
    """Helper to create a DataDownloader with a dummy preprocessor."""
    full_cfg: DictConfig = OmegaConf.create(
        {
            "base_path": str(tmp_path),
            "data": {
                "datasets": {
                    "test": {
                        "name": "test",
                        "preprocessor": {"type": "dummy"},
                        **dataset_cfg,
                    }
                },
            },
        }
    )
    return DataDownloader("test", full_cfg, DummyPreprocessor)


@pytest.fixture
def downloader_with_file_dataset(tmp_path: Path) -> DataDownloader:
    """Fixture that creates a downloader with a file-based dataset path."""
    downloader = _build_downloader(
        tmp_path,
        {
            "start": "2020-01-01",
            "end": "2020-01-31",
            "group_by": "monthly",
        },
    )
    downloader.path_dataset = tmp_path / "test.zarr"
    # Create the file to ensure it exists
    downloader.path_dataset.touch()
    return downloader


@pytest.fixture
def downloader_with_directory_dataset(tmp_path: Path) -> DataDownloader:
    """Fixture that creates a downloader with a directory-based dataset path."""
    downloader = _build_downloader(
        tmp_path,
        {
            "start": "2020-01-01",
            "end": "2020-01-31",
            "group_by": "monthly",
        },
    )
    downloader.path_dataset = tmp_path / "test_dir.zarr"
    downloader.path_dataset.mkdir(parents=True, exist_ok=True)
    return downloader


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread icenet_mp/data_processors/data_downloader.py Outdated
Comment thread icenet_mp/data_processors/data_downloader.py
Comment thread tests/test_cli.py
Comment thread icenet_mp/data_processors/data_downloader.py
Comment thread icenet_mp/data_processors/data_downloader.py Outdated
@jemrobinson
Copy link
Copy Markdown
Member Author

@IFenton : are you happy with this after our coworking yesterday?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

tests/data_processors/test_data_downloader.py:67

  • tests/data_processors/test_data_downloader.py no longer contains any test cases—only fixtures/helpers—and nothing else in tests/ references these fixtures. Consider deleting this file entirely, or add focused tests for the new DataDownloader flow (e.g., create() behavior around status()/finalise()), so changes to download logic remain covered.
from pathlib import Path

import pytest
from omegaconf import DictConfig, OmegaConf

from icenet_mp.data_processors.data_downloader import DataDownloader
from icenet_mp.data_processors.preprocessors.ipreprocessor import IPreprocessor


class DummyPreprocessor(IPreprocessor):
    """A dummy preprocessor for testing."""

    def download(self, preprocessor_path: Path) -> None:  # pragma: no cover - unused
        preprocessor_path.mkdir(parents=True, exist_ok=True)


def _build_downloader(tmp_path: Path, dataset_cfg: dict) -> DataDownloader:
    """Helper to create a DataDownloader with a dummy preprocessor."""
    full_cfg: DictConfig = OmegaConf.create(
        {
            "base_path": str(tmp_path),
            "data": {
                "datasets": {
                    "test": {
                        "name": "test",
                        "preprocessor": {"type": "dummy"},
                        **dataset_cfg,
                    }
                },
            },
        }
    )
    return DataDownloader("test", full_cfg, DummyPreprocessor)


@pytest.fixture
def downloader_with_file_dataset(tmp_path: Path) -> DataDownloader:
    """Fixture that creates a downloader with a file-based dataset path."""
    downloader = _build_downloader(
        tmp_path,
        {
            "start": "2020-01-01",
            "end": "2020-01-31",
            "group_by": "monthly",
        },
    )
    downloader.path_dataset = tmp_path / "test.zarr"
    # Create the file to ensure it exists
    downloader.path_dataset.touch()
    return downloader


@pytest.fixture
def downloader_with_directory_dataset(tmp_path: Path) -> DataDownloader:
    """Fixture that creates a downloader with a directory-based dataset path."""
    downloader = _build_downloader(
        tmp_path,
        {
            "start": "2020-01-01",
            "end": "2020-01-31",
            "group_by": "monthly",
        },
    )
    downloader.path_dataset = tmp_path / "test_dir.zarr"
    downloader.path_dataset.mkdir(parents=True, exist_ok=True)
    return downloader


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread icenet_mp/data_processors/data_downloader.py
Copy link
Copy Markdown
Contributor

@IFenton IFenton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all seems to work. Happy to merge, though it would be good to update the README

@jemrobinson jemrobinson merged commit 7bc80a7 into main Feb 27, 2026
7 checks passed
@jemrobinson jemrobinson deleted the 229-fix-download-logic branch February 27, 2026 13:42
IFenton pushed a commit that referenced this pull request Mar 2, 2026
✏️ Update the README for the changes made in #230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix bug in load_in_parts

3 participants