Skip to content

Conversation

@forsyth2
Copy link
Collaborator

Summary

Objectives:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 self-assigned this Sep 29, 2025
@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Sep 29, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 29, 2025

I added these debugging lines and then ran zppy. I got:

RuntimeError: Encountered errors: ["KeyError: var=LAISHA does not have rgn dimension, data_array=<xarray.DataArray 'LAISHA' (time: 10, lndgrid: 21600)> Size: 2MB\n\
dask.array<truediv, shape=(10, 21600), dtype=float64, chunksize=(1, 21600), chunktype=numpy.ndarray>\nCoordinates:\n  * time     (time) object 80B 1980-01-01 00:00\
:00 ... 1989-01-01 00:00:00\nDimensions without coordinates: lndgrid\nAttributes:\n    long_name:     shaded projected leaf area index\n    units:         1\n    c\
ell_methods:  time: mean\n    operation:     temporal_avg\n    mode:          group_average\n    freq:          year\n    weighted:      True\n    min_weight:    0\
.0", "KeyError: var=LAISUN does not have rgn dimension, data_array=<xarray.DataArray 'LAISUN' (time: 10, lndgrid: 21600)> Size: 2MB\ndask.array<truediv, shape=(10,\
 21600), dtype=float64, chunksize=(1, 21600), chunktype=numpy.ndarray>\nCoordinates:\n  * time     (time) object 80B 1980-01-01 00:00:00 ... 1989-01-01 00:00:00\nD\
imensions without coordinates: lndgrid\nAttributes:\n    long_name:     sunlit projected leaf area index\n    units:         1\n    cell_methods:  time: mean\n    \
operation:     temporal_avg\n    mode:          group_average\n    freq:          year\n    weighted:      True\n    min_weight:    0.0"]

I then ran zppy with all the same commits except for the merging of E3SM-Project/zppy#717 and it worked fine. That is, that PR appears to have introduced this issue.

@forsyth2
Copy link
Collaborator Author

@czender Is there anything in the latest NCO (i.e., the one used in E3SM-Project/zppy#717) that would cause rgn to be missing from data? Interestingly, this only happens on my v2.LR.historical_0201 test and not on my 3.LR.historical_0051 test.

@chengzhuzhang FYI ^^^

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Sep 29, 2025

@forsyth2 I'm looking at the global mean time-series file: /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/weekly_test_20250926/v2.LR.historical_0201/post/lnd/glb/ts/monthly/5yr/LAISHA_198001_198412.nc. I don't think global time series operation was applied successfully:

float LAISHA(time, lndgrid) ;
	LAISHA:long_name = "shaded projected leaf area index" ;
	LAISHA:units = "1" ;
	LAISHA:cell_methods = "time: mean" ;
	LAISHA:_FillValue = 1.e+36f ;
	LAISHA:missing_value = 1.e+36f ;

I'm also attaching the log file for ncclimo for troubleshooting:

Climatology operations invoked with command:
/home/ac.zender/bin_chrysalis/ncclimo --npo -c v2.LR.historical_0201 -v LAISHA,LAISUN --split --yr_srt=1980 --yr_end=1984 --ypf=5 -o output --rgn_avg --area=area --prc_typ=elm
Started climatology splitting at Fri Sep 26 16:42:14 CDT 2025
Running climatology script ncclimo from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
NCO binaries version 5.3.5 from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
Parallelism mode for files = background
Timeseries will be created for each of 2 variables
Background parallelism processing variables in var_nbr/job_nbr = 2/2 = 1 sequential batches each concurrently processing job_nbr = 2 jobs (1 per variable), then remaining 0 jobs/variables simultaneously
Will split data for each variable into one timeseries of length 5 years and 0 months
Splitting climatology from list of 60 raw input files piped to stdin
Each input file assumed to contain statistics for one month
Hemispherically and globally averaged timeseries files to be saved to directory output
Split files will not be regridded
Fri Sep 26 16:42:18 CDT 2025: Generated output/LAISHA_198001_198412.nc
Fri Sep 26 16:42:18 CDT 2025: Generated output/LAISUN_198001_198412.nc
ncclimo: WARNING Splitter is skipping regional statistics for variable LAISHA which is not a purely horizontal timeseries
ncclimo: WARNING Splitter is skipping regional statistics for variable LAISUN which is not a purely horizontal timeseries
Quick plots of last timeseries segment of last variable split:
ncvis output/LAISUN_198001_198412.nc &
Completed 5-year climatology operations for dataset with caseid = v2.LR.historical_0201 at Fri Sep 26 16:42:18 CDT 2025
Elapsed time 0m4s
==============================================
Elapsed time: 8 seconds
==============================================

@czender
Copy link

czender commented Sep 29, 2025

I'm looking into this now...

@czender
Copy link

czender commented Sep 29, 2025

@forsyth2 and @chengzhuzhang Thanks for reporting this. I can reproduce the error on v2 ELM data. There's apparently a bug in the new automated 3D variable rejector that only manifests on bigrid simulations, i.e., (time,lndgrid) instead of (time,lat,lon). I will try to fix this by tomorrow and I'll ping when you can re-test on Chrysalis.

@forsyth2
Copy link
Collaborator Author

Great, thanks @czender!

@czender
Copy link

czender commented Sep 29, 2025

@forsyth2 Here's a demonstration that the 3D checker works for EAMv2,v3 and ELMv3 but fails for ELMv2:

zender@maluhia:~$ ncks --is_hrz LAISUN ~/data/bm/elmv3_r05l15.nc
Yes
zender@maluhia:~$ ncks --is_hrz LAISUN ~/data/bm/elmv2_ne30pg2l15.nc
No
zender@maluhia:~$ ncks --is_hrz FSNT /Users/zender/data/bm/eamv2_ne30pg2l72.nc 
Yes
zender@maluhia:~$ ncks --is_hrz FSNT /Users/zender/data/bm/eamv3_ne30pg2l80.nc 
Yes

It also works on MPAS...
Ahah: I forgot to count the dimensionlndgrid as a horizontal dimension.
That's now fixed and the offline test passes:

zender@maluhia:~$ ncks --is_hrz LAISUN ~/data/bm/elmv2_ne30pg2l15.nc
Yes

Code is now rebuilt on Chrysalis, Perlmutter, and Andes.
OK to re-test with zppy using ncclimo --npo.
LMK if there are still problems.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang It sounds like @czender already has a fix, but if you still want to take a look at the logs:

Here's the relevant directories:

Testing on main Testing on main minus E3SM-Project/zppy#717
v2 test /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/rgn_error_test_20250929/v2.LR.historical_0201/post/scripts (error) /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/rgn_error_no717_test_20250929/v2.LR.historical_0201/post/scripts (no error)
v3 test /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/rgn_error_test_20250929/v3.LR.historical_0051/post/scripts (no error) Didn't run this

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 29, 2025

Code is now rebuilt on Chrysalis, Perlmutter, and Andes.

Thanks @czender! It seems promising; I was just testing a PR (E3SM-Project/zppy#722) rebased off main and the error was gone in the v2 run. I'll know for certain when I run the full integration test suite on main again.

@forsyth2
Copy link
Collaborator Author

I was just testing a PR (E3SM-Project/zppy#722) rebased off main and the error was gone in the v2 run.

Never mind on that; that turns out to be because of different error handling in #26. (Although, that does mean this error-handling PR won't have to merge).

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 30, 2025

Testing process
cd ~/ez/zppy-interfaces
git status
# Check for uncommitted changes
git checkout fix-gts-rgn
# We won't actually merge this branch. 
# But let's keep the debugging lines since 
# we don't have the error handling from the global time series split PR yet.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zi-rgn-error-20250929-try2
conda activate zi-rgn-error-20250929-try2
pre-commit run --all-files
python -m pip install .
pytest tests/unit/global_time_series/test_*.py
# 10 passed in 25.80s

cd ~/ez/zppy
git status
# Check for uncommitted changes
git fetch upstream main
git checkout -b fix-gts-rgn-try2 upstream/main
rm -rf build
conda clean --all --y # 16:10
conda env create -f conda/dev.yml -n zppy-rgn-error-20250929-try2
conda activate zppy-rgn-error-20250929-try2
pytest tests/test_*.py
# 25 passed in 0.53s

# Edit tests/integration/utils.py:
# TEST_SPECIFICS: Dict[str, Any] = {
#     "diags_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm-diags-main-20250926",
#     "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi-rgn-error-20250929-try2",
#     "cfgs_to_run": [
#         "weekly_comprehensive_v2",
#         "weekly_comprehensive_v3",
#     ],
#     "tasks_to_run": ["global_time_series"],
#     "unique_id": "rgn_error_test_20250929_try2",
# }
python tests/integration/utils.py
pre-commit run --all-files
python -m pip install .
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg
# ~40 min to run

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/rgn_error_test_20250929_try2/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# No errors

cd ~/ez/zppy
pytest tests/integration/test_images.py
# pytest tests/integration/test_images.py
cat test_images_summary.md

Diff subdir is where to find the lists of missing/mismatched images, the image diff grid, and the individual diffs.
Note image diff grids can not yet be constructed automatically.

Test name Total images Correct images Missing images Mismatched images Diff subdir
comprehensive_v2_global_time_series 6 6 0 0 /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_weekly_comprehensive_v2_www/rgn_error_test_20250929_try2/v2.LR.historical_0201/image_check_failures_comprehensive_v2/global_time_series
comprehensive_v3_global_time_series 15 0 15 0 /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_weekly_comprehensive_v3_www/rgn_error_test_20250929_try2/v3.LR.historical_0051/image_check_failures_comprehensive_v3/global_time_series

v3 just failed because I only ran v2, so it works fine!

@czender Thanks, this works in my short v2-only test!

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 30, 2025

@czender The tests now pass on main as well (E3SM-Project/zppy#634 (comment)), so we're good here!

I'm going to close this PR since #26 will add something similar.

@forsyth2 forsyth2 closed this Sep 30, 2025
@forsyth2 forsyth2 deleted the fix-gts-rgn branch October 20, 2025 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: bug Bug fix (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants