Skip to content

Fix multimet dataloader to support dynamic variable product names with underscores#238

Open
joshsturtevant wants to merge 5 commits intogoogle-research:mainfrom
joshsturtevant:js-dev
Open

Fix multimet dataloader to support dynamic variable product names with underscores#238
joshsturtevant wants to merge 5 commits intogoogle-research:mainfrom
joshsturtevant:js-dev

Conversation

@joshsturtevant
Copy link

Summary

  • Fixed bug where CHIRPS_GEFS would not parse correctly
  • Makes dynamic variable product name parsing in multimet.py more robust by normalizing product keys and handling known exceptions (beyond just ERA5_LAND)
  • Keeps existing config compatibility, but makes the loader more robust

Key Changes

  • Normalizes product names (case-insensitive, hyphens/underscores handled)
  • Adds explicit mapping for products that previously broke parsing (e.g., underscores)
  • Handles config keys like chirps_gefs, chirpsgefs, CHIRPS_GEFS, and CHIRPSGEFS
  • Adds logging when loading Zarr stores for both hindcast and forecast products

Motivation

  • Previously, the CHIRPS_GEFS product name would not parse correctly due to underscore
  • In contrast, the ERA5_LAND product name had a hardwired fix; this PR generalizes the solution

Testing

  • Ran with configs using various CHIRPS_GEFS and ERA5_LAND naming configurations (w and w/o underscores, etc.)
  • Loader correctly resolves to the correct product name and loads data as expected

@google-cla
Copy link

google-cla bot commented Feb 12, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@amitmarkel amitmarkel removed their request for review February 16, 2026 08:43
Copy link
Collaborator

@grey-nearing grey-nearing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Josh,

Thank you SO MUCH for being the first public contributor to GoogleHydrology!!! And thank you for your patience while we worked on the backend to make sure we had a system in place for handling public PRs!

This is a great fix for CHIRPS. Would you mind taking a look at my questions? It's possible that I've missed something important.

Thanks,
Grey

def _get_products_and_bands_from_feature_strings(
features: Iterable[str],
) -> dict[str, list[str]]:
def _get_products_and_bands_from_feature_dict(feature_dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config input properties are not always dicts. They can be either dicts or lists, depending on whether you want to use a model with vs. without feature groups. This is why the cfg.hindcast_inputs and cfg.forecast_inputs are flattened in lines 182 and 183.

However, this caused me to notice that the typehints in ~/googlehydrology/utils/config.py are wrong -- the typehints only show lists, not dicts, and it should be a Union. I'm fixing this presently.

LOGGER.debug(f'Sample index size: {sizeof(indices) / 1024**2} MB')

# TODO(future) :: Move above to the scalar compute block
LOGGER.debug('scaler check zero scale')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I ask you to pull and merge the most recent changes from main? This replication of scaler saving was removed in a recent PR.

MULTIMET_MINIMUM_LEAD_TIME = 1

# Caravan Multimet products that are available in the GCS zarr store
KNOWN_GCS_PRODUCTS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the basic strategy here of generalizing the removal of underscores in variable names, but only to known products. Is it necessary to keep a list of known products? It looks like most of these are not used for anything functional, and instead the PRODUCT_ALIASES is doing the heavy lifting. Is there a need to keep this known products list?


# If it's a known product, use canonical name
# (e.g., user says 'era5land', but dataset is 'ERA5_LAND')
for known in KNOWN_GCS_PRODUCTS:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this loop doing something after the mapping call in line 1111?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants