Fix multimet dataloader to support dynamic variable product names with underscores#238
Fix multimet dataloader to support dynamic variable product names with underscores#238joshsturtevant wants to merge 5 commits intogoogle-research:mainfrom
Conversation
…duct name normalization/robustness
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
grey-nearing
left a comment
There was a problem hiding this comment.
Hi Josh,
Thank you SO MUCH for being the first public contributor to GoogleHydrology!!! And thank you for your patience while we worked on the backend to make sure we had a system in place for handling public PRs!
This is a great fix for CHIRPS. Would you mind taking a look at my questions? It's possible that I've missed something important.
Thanks,
Grey
| def _get_products_and_bands_from_feature_strings( | ||
| features: Iterable[str], | ||
| ) -> dict[str, list[str]]: | ||
| def _get_products_and_bands_from_feature_dict(feature_dict): |
There was a problem hiding this comment.
The config input properties are not always dicts. They can be either dicts or lists, depending on whether you want to use a model with vs. without feature groups. This is why the cfg.hindcast_inputs and cfg.forecast_inputs are flattened in lines 182 and 183.
However, this caused me to notice that the typehints in ~/googlehydrology/utils/config.py are wrong -- the typehints only show lists, not dicts, and it should be a Union. I'm fixing this presently.
| LOGGER.debug(f'Sample index size: {sizeof(indices) / 1024**2} MB') | ||
|
|
||
| # TODO(future) :: Move above to the scalar compute block | ||
| LOGGER.debug('scaler check zero scale') |
There was a problem hiding this comment.
Could I ask you to pull and merge the most recent changes from main? This replication of scaler saving was removed in a recent PR.
| MULTIMET_MINIMUM_LEAD_TIME = 1 | ||
|
|
||
| # Caravan Multimet products that are available in the GCS zarr store | ||
| KNOWN_GCS_PRODUCTS = { |
There was a problem hiding this comment.
I like the basic strategy here of generalizing the removal of underscores in variable names, but only to known products. Is it necessary to keep a list of known products? It looks like most of these are not used for anything functional, and instead the PRODUCT_ALIASES is doing the heavy lifting. Is there a need to keep this known products list?
|
|
||
| # If it's a known product, use canonical name | ||
| # (e.g., user says 'era5land', but dataset is 'ERA5_LAND') | ||
| for known in KNOWN_GCS_PRODUCTS: |
There was a problem hiding this comment.
Is this loop doing something after the mapping call in line 1111?
Summary
CHIRPS_GEFSwould not parse correctlymultimet.pymore robust by normalizing product keys and handling known exceptions (beyond justERA5_LAND)Key Changes
chirps_gefs,chirpsgefs,CHIRPS_GEFS, andCHIRPSGEFSMotivation
Testing