Corrdiff generic Xarray dataloader #1167

jleinonen · 2025-10-22T17:41:07Z

PhysicsNeMo Pull Request

Description

Adds a generic Xarray-based dataloader (XarrayDataset) for CorrDiff. It is meant for users who have simple use cases and don't need/want to write their own dataloaders. It can also be used as a baseline for more complex dataloaders.

The feature list is intentionally kept compact to reduce clutter, supporting a few common use cases and optimizations:

Support for any file type that can be read by xarray.open_dataset, so e.g. NetCDF4 and Zarr will work.
One or more data files can be used
Data can optionally be pre-loaded to RAM, speeding up reads for small datasets
Specifying time ranges (e.g. for splitting to training and validation sets) and excluded sample times (e.g. to filter out bad data)
Sharding data so that each process will only use a slice of the dataset; this allows better caching and larger datasets to be used with load_to_memory == True.

A function create_sample_dataset is supplied in the xarray_generic.py module that can be used to generate a mock data file to be used as a template for real data files.

A YAML configuration file for the dataset is included.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a generic Xarray-based dataloader (XarrayDataset) to the CorrDiff weather example, providing an out-of-the-box solution for users with simple datasets. The implementation includes a flexible configuration system, support for multiple file formats (NetCDF4, Zarr), optional data preloading, time filtering, and worker sharding. The PR also introduces a dynamic dataset registration system (register_dataset) that allows users to plug in custom dataset classes from external files without modifying core code. The new dataloader fits into the existing Hydra configuration structure used by CorrDiff, accessible via - dataset: xarray in the defaults section.

PR Description Notes: CHANGELOG.md is marked incomplete in the checklist and should be updated before merging. No linked issue provided.

Critical Issues

Mutable default argument bug (xarray_generic.py:114): The parameter open_dataset_kwargs: dict[str, Any] = {} uses a mutable default that will be shared across all instances, potentially causing cross-contamination of dataset configurations. Change to None and initialize as open_dataset_kwargs = open_dataset_kwargs or {}.
Logic error with empty invariant variables (xarray_generic.py:147-154, 235-236): When invariant_variables=[] (empty list explicitly provided), the condition if invariant_variables or (invariant_variables is None) evaluates to False, setting self.invariants = [] and self.invariant_variables = []. However, line 235 checks if self.invariant_variables: which will be False, skipping concatenation. But the real issue is that the current logic treats [] as "no invariants" when it might be intentional. The condition should be if invariant_variables is not None: to handle all three cases correctly (None=default, []=explicitly none, ["var1"]=specific variables).
Silent registration failure (dataset.py:64-65): The early return when class_name in known_datasets silently prevents registration of datasets with the same class name from different files. This could cause confusing behavior where a user expects their custom dataset to be used but the system uses a previously registered one instead. Should log a warning or raise an error to alert users of the conflict.
Inconsistent registration key (dataset.py:87): The dataset is registered using the full dataset_spec (e.g., "path/to/file.py::MyDataset") as the key, but users would more naturally reference it by class name (e.g., type: MyDataset). This creates a usability issue where the registration syntax doesn't match the usage syntax. Consider registering by class name or documenting this behavior clearly.

Potential Issues

Datetime deprecation (xarray_generic.py:312): Uses datetime.datetime.utcfromtimestamp(), which is deprecated in Python 3.12+. Replace with datetime.datetime.fromtimestamp(t.tolist() / 1e9, tz=datetime.timezone.utc).
Typo in config (xarray.yaml:40): "daatloader" should be "dataloader" in the comment "Use a separate subset of data for each daatloader worker".
Missing auto-registration (dataset.py:150-154): When an unknown dataset type is encountered, the code will raise a KeyError instead of attempting to auto-register from the dataset_spec. This means users must manually call register_dataset() before init_dataset_from_config(), which isn't obvious from the API design.
Type hints incomplete (xarray_generic.py:114): The function signature uses dict[str, Any] which is correct, but consider using dict[str, Any] | None for clarity once the mutable default is fixed.

Confidence Score: 3 / 5

The core functionality is well-designed and addresses a real need, but the critical bugs (mutable default, logic error, silent failures) could cause runtime errors or unexpected behavior. These issues should be resolved before merging.

Additional Comments (1)

examples/weather/corrdiff/datasets/dataset.py, line 64-65 (link)

logic: If class_name is already registered, the function returns early. This means if two different files define classes with the same name, only the first will be registered. Is this the intended behavior? Should there be a check to prevent registering different classes with the same name from different files, or a warning when this happens?

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-22T17:42:20Z

examples/weather/corrdiff/datasets/xarray_generic.py

+        output_variables: list[str] | None = None,
+        invariant_variables: list[str] | None = None,
+        load_to_memory: bool = False,
+        open_dataset_kwargs: dict[str, Any] = {},


logic: Mutable default argument {} can lead to shared state across instances. Use None and initialize inside.

greptile-apps · 2025-10-22T17:42:20Z

examples/weather/corrdiff/datasets/xarray_generic.py

+        if invariant_variables or (invariant_variables is None):
+            (self.invariants, self.invariant_variables) = _load_data(
+                self.datasets[0], "invariant", invariant_variables
+            )
+            self.invariants = self.invariants.values
+        else:
+            self.invariants = []
+            self.invariant_variables = []


logic: Empty list [] will fail on line 236 with invariant_variables==[], since if [] is False but concatenation expects array. Consider checking len(self.invariant_variables) > 0 at line236.

jleinonen and others added 3 commits October 22, 2025 09:59

Add generic xarray dataloader

7cc736b

Merge branch 'NVIDIA:main' into corrdiff-generic-dataloader

784f4d2

Add dataset and config files

eedd0d8

jleinonen self-assigned this Oct 22, 2025

jleinonen requested a review from CharlelieLrt as a code owner October 22, 2025 17:41

greptile-apps bot reviewed Oct 22, 2025

View reviewed changes

CharlelieLrt self-assigned this Oct 22, 2025

jleinonen and others added 2 commits October 23, 2025 03:34

Updates for minor bugs

c1e5990

Merge branch 'NVIDIA:main' into corrdiff-generic-dataloader

e363fd1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Corrdiff generic Xarray dataloader #1167

Corrdiff generic Xarray dataloader #1167

Uh oh!

jleinonen commented Oct 22, 2025

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

greptile-apps bot Oct 22, 2025

Uh oh!

greptile-apps bot Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Corrdiff generic Xarray dataloader #1167

Are you sure you want to change the base?

Corrdiff generic Xarray dataloader #1167

Uh oh!

Conversation

jleinonen commented Oct 22, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Critical Issues

Potential Issues

Confidence Score: 3 / 5

Additional Comments (1)

Uh oh!

greptile-apps bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot left a comment •

edited

Loading