Skip to content

Conversation

@jleinonen
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

Adds a generic Xarray-based dataloader (XarrayDataset) for CorrDiff. It is meant for users who have simple use cases and don't need/want to write their own dataloaders. It can also be used as a baseline for more complex dataloaders.

The feature list is intentionally kept compact to reduce clutter, supporting a few common use cases and optimizations:

  • Support for any file type that can be read by xarray.open_dataset, so e.g. NetCDF4 and Zarr will work.
  • One or more data files can be used
  • Data can optionally be pre-loaded to RAM, speeding up reads for small datasets
  • Specifying time ranges (e.g. for splitting to training and validation sets) and excluded sample times (e.g. to filter out bad data)
  • Sharding data so that each process will only use a slice of the dataset; this allows better caching and larger datasets to be used with load_to_memory == True.

A function create_sample_dataset is supplied in the xarray_generic.py module that can be used to generate a mock data file to be used as a template for real data files.

A YAML configuration file for the dataset is included.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@jleinonen jleinonen self-assigned this Oct 22, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a generic Xarray-based dataloader (XarrayDataset) to the CorrDiff weather example, providing an out-of-the-box solution for users with simple datasets. The implementation includes a flexible configuration system, support for multiple file formats (NetCDF4, Zarr), optional data preloading, time filtering, and worker sharding. The PR also introduces a dynamic dataset registration system (register_dataset) that allows users to plug in custom dataset classes from external files without modifying core code. The new dataloader fits into the existing Hydra configuration structure used by CorrDiff, accessible via - dataset: xarray in the defaults section.

PR Description Notes: CHANGELOG.md is marked incomplete in the checklist and should be updated before merging. No linked issue provided.

Critical Issues

  1. Mutable default argument bug (xarray_generic.py:114): The parameter open_dataset_kwargs: dict[str, Any] = {} uses a mutable default that will be shared across all instances, potentially causing cross-contamination of dataset configurations. Change to None and initialize as open_dataset_kwargs = open_dataset_kwargs or {}.

  2. Logic error with empty invariant variables (xarray_generic.py:147-154, 235-236): When invariant_variables=[] (empty list explicitly provided), the condition if invariant_variables or (invariant_variables is None) evaluates to False, setting self.invariants = [] and self.invariant_variables = []. However, line 235 checks if self.invariant_variables: which will be False, skipping concatenation. But the real issue is that the current logic treats [] as "no invariants" when it might be intentional. The condition should be if invariant_variables is not None: to handle all three cases correctly (None=default, []=explicitly none, ["var1"]=specific variables).

  3. Silent registration failure (dataset.py:64-65): The early return when class_name in known_datasets silently prevents registration of datasets with the same class name from different files. This could cause confusing behavior where a user expects their custom dataset to be used but the system uses a previously registered one instead. Should log a warning or raise an error to alert users of the conflict.

  4. Inconsistent registration key (dataset.py:87): The dataset is registered using the full dataset_spec (e.g., "path/to/file.py::MyDataset") as the key, but users would more naturally reference it by class name (e.g., type: MyDataset). This creates a usability issue where the registration syntax doesn't match the usage syntax. Consider registering by class name or documenting this behavior clearly.

Potential Issues

  1. Datetime deprecation (xarray_generic.py:312): Uses datetime.datetime.utcfromtimestamp(), which is deprecated in Python 3.12+. Replace with datetime.datetime.fromtimestamp(t.tolist() / 1e9, tz=datetime.timezone.utc).

  2. Typo in config (xarray.yaml:40): "daatloader" should be "dataloader" in the comment "Use a separate subset of data for each daatloader worker".

  3. Missing auto-registration (dataset.py:150-154): When an unknown dataset type is encountered, the code will raise a KeyError instead of attempting to auto-register from the dataset_spec. This means users must manually call register_dataset() before init_dataset_from_config(), which isn't obvious from the API design.

  4. Type hints incomplete (xarray_generic.py:114): The function signature uses dict[str, Any] which is correct, but consider using dict[str, Any] | None for clarity once the mutable default is fixed.

Confidence Score: 3 / 5

The core functionality is well-designed and addresses a real need, but the critical bugs (mutable default, logic error, silent failures) could cause runtime errors or unexpected behavior. These issues should be resolved before merging.

Additional Comments (1)

  1. examples/weather/corrdiff/datasets/dataset.py, line 64-65 (link)

    logic: If class_name is already registered, the function returns early. This means if two different files define classes with the same name, only the first will be registered. Is this the intended behavior? Should there be a check to prevent registering different classes with the same name from different files, or a warning when this happens?

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

output_variables: list[str] | None = None,
invariant_variables: list[str] | None = None,
load_to_memory: bool = False,
open_dataset_kwargs: dict[str, Any] = {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Mutable default argument {} can lead to shared state across instances. Use None and initialize inside.

Comment on lines +147 to +154
if invariant_variables or (invariant_variables is None):
(self.invariants, self.invariant_variables) = _load_data(
self.datasets[0], "invariant", invariant_variables
)
self.invariants = self.invariants.values
else:
self.invariants = []
self.invariant_variables = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Empty list [] will fail on line 236 with invariant_variables==[], since if [] is False but concatenation expects array. Consider checking len(self.invariant_variables) > 0 at line236.

@CharlelieLrt CharlelieLrt self-assigned this Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants