-
Notifications
You must be signed in to change notification settings - Fork 473
Corrdiff generic Xarray dataloader #1167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds a generic Xarray-based dataloader (XarrayDataset) to the CorrDiff weather example, providing an out-of-the-box solution for users with simple datasets. The implementation includes a flexible configuration system, support for multiple file formats (NetCDF4, Zarr), optional data preloading, time filtering, and worker sharding. The PR also introduces a dynamic dataset registration system (register_dataset) that allows users to plug in custom dataset classes from external files without modifying core code. The new dataloader fits into the existing Hydra configuration structure used by CorrDiff, accessible via - dataset: xarray in the defaults section.
PR Description Notes: CHANGELOG.md is marked incomplete in the checklist and should be updated before merging. No linked issue provided.
Critical Issues
-
Mutable default argument bug (
xarray_generic.py:114): The parameteropen_dataset_kwargs: dict[str, Any] = {}uses a mutable default that will be shared across all instances, potentially causing cross-contamination of dataset configurations. Change toNoneand initialize asopen_dataset_kwargs = open_dataset_kwargs or {}. -
Logic error with empty invariant variables (
xarray_generic.py:147-154, 235-236): Wheninvariant_variables=[](empty list explicitly provided), the conditionif invariant_variables or (invariant_variables is None)evaluates toFalse, settingself.invariants = []andself.invariant_variables = []. However, line 235 checksif self.invariant_variables:which will beFalse, skipping concatenation. But the real issue is that the current logic treats[]as "no invariants" when it might be intentional. The condition should beif invariant_variables is not None:to handle all three cases correctly (None=default, []=explicitly none, ["var1"]=specific variables). -
Silent registration failure (
dataset.py:64-65): The early return whenclass_name in known_datasetssilently prevents registration of datasets with the same class name from different files. This could cause confusing behavior where a user expects their custom dataset to be used but the system uses a previously registered one instead. Should log a warning or raise an error to alert users of the conflict. -
Inconsistent registration key (
dataset.py:87): The dataset is registered using the fulldataset_spec(e.g.,"path/to/file.py::MyDataset") as the key, but users would more naturally reference it by class name (e.g.,type: MyDataset). This creates a usability issue where the registration syntax doesn't match the usage syntax. Consider registering by class name or documenting this behavior clearly.
Potential Issues
-
Datetime deprecation (
xarray_generic.py:312): Usesdatetime.datetime.utcfromtimestamp(), which is deprecated in Python 3.12+. Replace withdatetime.datetime.fromtimestamp(t.tolist() / 1e9, tz=datetime.timezone.utc). -
Typo in config (
xarray.yaml:40): "daatloader" should be "dataloader" in the comment "Use a separate subset of data for each daatloader worker". -
Missing auto-registration (
dataset.py:150-154): When an unknown dataset type is encountered, the code will raise aKeyErrorinstead of attempting to auto-register from the dataset_spec. This means users must manually callregister_dataset()beforeinit_dataset_from_config(), which isn't obvious from the API design. -
Type hints incomplete (
xarray_generic.py:114): The function signature usesdict[str, Any]which is correct, but consider usingdict[str, Any] | Nonefor clarity once the mutable default is fixed.
Confidence Score: 3 / 5
The core functionality is well-designed and addresses a real need, but the critical bugs (mutable default, logic error, silent failures) could cause runtime errors or unexpected behavior. These issues should be resolved before merging.
Additional Comments (1)
-
examples/weather/corrdiff/datasets/dataset.py, line 64-65 (link)logic: If
class_nameis already registered, the function returns early. This means if two different files define classes with the same name, only the first will be registered. Is this the intended behavior? Should there be a check to prevent registering different classes with the same name from different files, or a warning when this happens?
4 files reviewed, 3 comments
| output_variables: list[str] | None = None, | ||
| invariant_variables: list[str] | None = None, | ||
| load_to_memory: bool = False, | ||
| open_dataset_kwargs: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Mutable default argument {} can lead to shared state across instances. Use None and initialize inside.
| if invariant_variables or (invariant_variables is None): | ||
| (self.invariants, self.invariant_variables) = _load_data( | ||
| self.datasets[0], "invariant", invariant_variables | ||
| ) | ||
| self.invariants = self.invariants.values | ||
| else: | ||
| self.invariants = [] | ||
| self.invariant_variables = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Empty list [] will fail on line 236 with invariant_variables==[], since if [] is False but concatenation expects array. Consider checking len(self.invariant_variables) > 0 at line236.
PhysicsNeMo Pull Request
Description
Adds a generic Xarray-based dataloader (
XarrayDataset) for CorrDiff. It is meant for users who have simple use cases and don't need/want to write their own dataloaders. It can also be used as a baseline for more complex dataloaders.The feature list is intentionally kept compact to reduce clutter, supporting a few common use cases and optimizations:
xarray.open_dataset, so e.g. NetCDF4 and Zarr will work.load_to_memory == True.A function
create_sample_datasetis supplied in thexarray_generic.pymodule that can be used to generate a mock data file to be used as a template for real data files.A YAML configuration file for the dataset is included.
Checklist
Dependencies