Skip to content

Conversation

@chengzhuzhang
Copy link
Contributor

@chengzhuzhang chengzhuzhang commented May 2, 2025

Description

This PR is to address the regridding difference identified here:#964
lat_lon_HadISST_CL-SST-metrics-diff.ipynb:

  • Shows differences in SST (Sea Surface Temperature) metrics (difference only shown in metrics table)
  • v2 to v3 reference mean changes (18.698°C vs 18.777°C) due to regridding differences
  • In v3, explicit masks are needed to prevent missing data from affecting results
  • Provides possible solutions for handling masks with the xcdat regridder

This PR explicitly create a mask variable before regriding to prevent missing data affecting regridding results.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@chengzhuzhang
Copy link
Contributor Author

Re-run a full set run. The SST metrics are now closer to what we have in v2:
New:
Variables Unit Test_mean Ref._mean Mean_Bias Test_STD Ref._STD RMSE Correlation
SST global HadISST_CL degC 20.256 18.674 1.583 8.178 9.558 1.054 0.992
SST global HadISST_PI degC 20.256 18.964 1.292 8.178 8.952 1.232 0.991
SST global HadISST_PD degC 20.256 18.787 1.469 8.178 9.561 1.082 0.992

V2:
Variables Unit Test_mean Ref._mean Mean_Bias Test_STD Ref._STD RMSE Correlation
SST global HadISST_CL degC 20.256 18.698 1.559 8.178 9.536 1.054 0.992
SST global HadISST_PI degC 20.256 18.978 1.279 8.178 8.933 1.232 0.991
SST global HadISST_PD degC 20.256 18.807 1.45 8.178 9.543 1.082 0.992

V3:
Variables Unit Test_mean Ref._mean Mean_Bias Test_STD Ref._STD RMSE Correlation
SST global HadISST_CL degC 20.256 18.777 1.48 8.178 9.464 1.055 0.992
SST global HadISST_PI degC 20.256 19.058 1.199 8.178 8.853 1.233 0.991
SST global HadISST_PD degC 20.256 18.885 1.372 8.178 9.47 1.082 0.992

@chengzhuzhang chengzhuzhang marked this pull request as ready for review May 8, 2025 20:45
@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder I think this PR is in good shape, please review when get a chance, thanks!

Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chengzhuzhang, I left some questions and suggestions. This PR is coming together and almost done. Thanks.

Comment on lines 436 to 437

if tool == "regrid2" or not any(dim in dims_to_check for dim in var_dims):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if tool == "regrid2" or not any(dim in dims_to_check for dim in var_dims):
# Create a mask for xESMF if the variable is 2D (does not include specified dimensions)
if tool == "xesmf" and not any(dim in dims_to_check for dim in var_dims):

Are we supposed to add the mask variable whenever the tool is xesmf, rather than regrid2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is right without changing, assumming regrid2 can take 3d var as input for regriding?

- Add unit tests for `_add_mask`
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jill, I pushed some final suggestions. I think this PR is good to go if you agree with the changes.

Let me know what you think.

Comment on lines 410 to 415
def _add_mask(ds: xr.Dataset, var_key: str, tool: str) -> xr.Dataset:
"""Add a mask variable to the dataset.
This function creates a mask variable for the specified variable key in
the dataset if the tool is "regrid2" (which supports 3D variables) or if
the variable is 2D with only spatial dimensions (e.g., "X" and "Y").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed _add_mask_for_2d to _add_mask since it supports 3D variables with regrid2.

I also added more descriptions to the docstring and removed dims_to_check parameter.

Comment on lines 431 to 452
ds_new = ds.copy()
var = ds_new[var_key]
var_dims = var.dims

if tool == "regrid2" or not any(dim in dims_to_check for dim in var_dims):
spatial_dims = {"x", "y", "lon", "lat", "longitude", "latitude"}
is_spatial_var = len(var_dims) == 2 and all(
str(dim).lower() in spatial_dims for dim in var_dims
)

if tool == "regrid2" or is_spatial_var:
logger.debug(f"Creating mask for {var_key} with dimensions {var_dims}")
ds["mask"] = xr.where(~np.isnan(ds[var_key]), 1, 0)

if "mask" in ds_new:
logger.warning("Overwriting existing 'mask' variable in the dataset.")

ds_new["mask"] = xr.where(~np.isnan(var), 1, 0)
else:
logger.debug(
f"Skipping mask creation for variable {var_key} with dimensions {var_dims}"
)

return ds
return ds_new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the logic of _add_mask to make it easier to understand. Now, it checks if the variable is a 2D spatial variable (if it is, it allows regridding using xESMF).

This is simpler than the old approach, which checked whether any of the variable’s dimensions weren't in a list of vertical and time dimensions. That method was harder to follow and skipped regridding with xESMF if any non-spatial dimension was present.

Copy link
Collaborator

@tomvothecoder tomvothecoder May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it raises a logger.warning if an existing mask variable will be overwritten

np.testing.assert_array_equal(result["mask"].values, np.array([[1, 0], [1, 1]]))

# Ensure a warning is logged
assert "Overwriting existing 'mask' variable in the dataset" in caplog.text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added unit tests to cover various cases with _add_mask().

It is easier to test _add_mask() directly because align_grids_to_lower_res() removes the mask variable by the end of regridding, which makes it harder to detect if a mask was used unless we write more comprehensive test to check for np.nan.

var_dims = var.dims

spatial_dims = {"x", "y", "lon", "lat", "longitude", "latitude"}
is_spatial_var = len(var_dims) == 2 and all(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is a little risky than the old logic, for cases when excessive dims present? I'm not sure if those were completely dropped at some point. I can test a full run again to see if all variables are produced.

Copy link
Collaborator

@tomvothecoder tomvothecoder May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally understood the logic for xESMF regridding as: ignore any dimensions that are Z (vertical) or T (time), which effectively means we expect to regrid a 2D variable with just X and Y (horizontal spatial dimensions).

As far as I can recall, I haven't encountered any variables with more than these four dimensions (X, Y, Z, T) but there might be some that I don't remember. If such variables do exist and we want to proceed with regridding them anyway, this logic would end up ignoring those variables entirely which is problematic.

Also, based on what I'm reading, xESMF does support regridding variables with more than two dimensions by broadcasting. It applies the 2D regridding independently across each slice along the non-spatial dimensions. Do we want to continue regridding variables that have >2 dimensions minus Z and T? If so, then we'll need to revert the logic back to the old one.

Copy link
Contributor Author

@chengzhuzhang chengzhuzhang May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cosp related dimensions, but not sure if those are dropped in the course of data ingection

xESMF does support 3Dvar as our original implmentation, but doesn't support 3d masks, as I understand. Here is the code line that throw errors when passing in 3d mask: https://github.com/pangeo-data/xESMF/blob/30e3ecb094d39a0c6f1edb22b60d8d608a19f7d9/xesmf/backend.py#L138. Before xESMF relaxing this constraint, we can only have more accurate regridding for 2d vars, at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the refactored and refined code from your change. Could you only revert the logic that handles exclusion of dims? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good!

Comment on lines 437 to 440
dims_to_skip = {"lev", "plev", "z", "time", "t"}
has_skipped_dims = any(str(dim).lower() in dims_to_skip for dim in var_dims)

if tool == "regrid2" or is_spatial_var:
if tool == "regrid2" or not has_skipped_dims:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the logic for checking dimensions back to the old logic in 7bd8f29 (#974)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@chengzhuzhang chengzhuzhang changed the title Creates a mask for datasets before regridding Creates a mask for 2d datasets before regridding May 14, 2025
@chengzhuzhang chengzhuzhang merged commit 93c067f into main May 14, 2025
6 checks passed
@chengzhuzhang chengzhuzhang deleted the fix_mask_in_regridding branch May 14, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants