Skip to content

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented May 21, 2025

Summary

Objectives:

  • Remove vars_exclude parameter, labeling it as deprecated, and adding it to the deprecated parameters test (i.e., test that it has no effect)
  • Use latest NCO (directly use @czender's NCO until that version is included in Unified, or a dev environment of NCO we could use)

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Please fill out either the "Small Change" or "Big Change" section (the latter includes the numbered subsections), and delete the other.

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 self-assigned this May 21, 2025
@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label May 21, 2025
@forsyth2
Copy link
Collaborator Author

python tests/integration/utils.py
pip install .
# This cfg tests vars="" on the ts lnd task.
zppy -c tests/integration/generated/test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg 
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# The ts lnd tasks hit the time limit

The new NCO seems to be hitting the time limit; will need to check if that happens on main.

@czender
Copy link

czender commented May 23, 2025

Please keep me posted. Nothing I changed in NCO should be taking longer. If it turns out that NCO is hanging or taking longer on the ts lnd task, then it would be helpful to receive a sample command that demonstrates this.

@forsyth2
Copy link
Collaborator Author

I ran the same block as above, but on main's code. That completed in just a few minutes.

So what's different here? Not much -- the relevant diff is pretty minimal:
Screenshot 2025-05-28 at 12 45 16

What about how it's called? In both cases, I'm running test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg , which has the ts task set up as follows:

[ts]
active = True
walltime = "00:30:00"
years = "1985:1995:5",

  [[ atm_monthly_glb ]]
  # Note global average won't work for 3D variables.
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ lnd_monthly_glb ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  mapping_file = "glb"
  vars = "" # Get all available variables

That is, we're asking for all available land variables. On main, we do that via an exclude-list and on this branch, we do it via the new version of NCO.

@forsyth2
Copy link
Collaborator Author

Re-running on this branch and it does seem to be taking longer, so unfortunately I don't think it was an anomaly the other day.

@czender
Copy link

czender commented May 28, 2025

Just to confirm, have you verified that using ncclimo --npo runs and completes as expected for a limited number of variables? For example, for -v TBOT,QBOT,W_SCALAR? If that's the case then I will check whether the new feature also hangs for me when requesting all variables from a default ELM dataset (I've only checked thus far on ELM datasets that contain about 10 2D vars).

@forsyth2
Copy link
Collaborator Author

@czender yeah, that's a good idea to check, thanks. I'm not going to have time to check today, but hopefully tomorrow.

@forsyth2
Copy link
Collaborator Author

@czender I just ran with vars = "TBOT,QBOT,W_SCALAR". Now I get a non-zero exit code pretty quickly.

> cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051/post/scripts

> grep -v "OK" *status
global_time_series_1985-1995.status:WAITING 752785
ts_lnd_monthly_glb_1985-1989-0005.status:ERROR (2)
ts_lnd_monthly_glb_1990-1994-0005.status:ERROR (2)

> cat ts_lnd_monthly_glb_1985-1989-0005.o752783 
Climatology operations invoked with command:
/home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 -v TBOT,QBOT,W_SCALAR --split --yr_srt=1985 --yr_end=1989 --ypf=5 -o output --rgn_avg --area=area --prc_typ=elm
Started climatology splitting at Thu May 29 14:02:36 CDT 2025
Running climatology script ncclimo from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
NCO binaries version 5.3.4-alpha05 from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
Parallelism mode = background
Timeseries will be created for each of 3 variables
Background parallelism processing variables in var_nbr/job_nbr = 3/3 = 1 sequential batches each concurrently processing job_nbr = 3 jobs (1 per variable), then remaining 0 jobs/variables simultaneously
Will split data for each variable into one timeseries of length 5 years and 0 months
Splitting climatology from list of 60 raw input files piped to stdin
Each input file assumed to contain statistics for one month
Hemispherically and globally averaged timeseries files to be saved to directory output
Split files will not be regridded
Thu May 29 14:02:45 CDT 2025: Generated output/TBOT_198501_198912.nc
Thu May 29 14:02:45 CDT 2025: Generated output/QBOT_198501_198912.nc
Thu May 29 14:02:45 CDT 2025: Generated output/W_SCALAR_198501_198912.nc
ncap2: ERROR 1 dimensions of W_SCALAR_tmp belong to template Internally generated template but 1 dimensions do not
Thu May 29 14:02:49 CDT 2025: Global and regional statistics output/TBOT_198501_198912.nc
Thu May 29 14:02:49 CDT 2025: Global and regional statistics output/QBOT_198501_198912.nc
ncclimo: ERROR Failed in global and regional statistics. cmd_stt[2] failed. Debug this:
 OMP_PROC_BIND=false ncap2 -h -O -s '*rgn_nbr=3;defdim("rgn",rgn_nbr);*W_SCALAR_tmp=0.0f*W_SCALAR.avg($lat,$lon);*W_SCALAR_rgn[time,rgn]=W_SCALAR_tmp;W_SCALAR_rgn@coordinates="region_name";*lat_area=lat+0.0*area;*msk_sth=0*lat_area.int();delete_miss(msk_sth);*msk_nrt=0*lat_area.int();delete_miss(msk_nrt);*idx_glb=0;*idx_nrt=1;*idx_sth=2;*rgn_len=19;defdim("rgn_len",rgn_len);region_name[rgn,rgn_len]=" ";region_name(idx_glb,0:5)="Global";region_name(idx_nrt,:)="Northern Hemisphere";region_name(idx_sth,:)="Southern Hemisphere";if(0) region_name@long_name="W_SCALAR timeseries array contains area-weighted sums over these regions"; else region_name@long_name="W_SCALAR timeseries array contains area-weighted averages over these regions";where(lat_area < 0.0) msk_sth=1; elsewhere msk_nrt=1;W_SCALAR_rgn(:,idx_glb)=((W_SCALAR*area*landfrac).avg($lat,$lon)/(area*landfrac).avg($lat,$lon)).float();W_SCALAR_rgn(:,idx_nrt)=((W_SCALAR*area*landfrac*msk_nrt).avg($lat,$lon)/(area*landfrac*msk_nrt).avg($lat,$lon)).float();W_SCALAR_rgn(:,idx_sth)=((W_SCALAR*area*landfrac*msk_sth).avg($lat,$lon)/(area*landfrac*msk_sth).avg($lat,$lon)).float();if(0){W_SCALAR_rgn(:,idx_glb)=W_SCALAR_rgn(:,idx_glb)*(area*landfrac).total($lat,$lon).float()*1.0f;W_SCALAR_rgn(:,idx_nrt)=W_SCALAR_rgn(:,idx_nrt)*(area*landfrac*msk_nrt).total($lat,$lon).float()*1.0f;W_SCALAR_rgn(:,idx_sth)=W_SCALAR_rgn(:,idx_sth)*(area*landfrac*msk_sth).total($lat,$lon).float()*1.0f;}W_SCALAR=W_SCALAR_rgn;push(&W_SCALAR@cell_methods," area: mean");if(exists(time_bnds)) time_bnds=time_bnds;if(exists(time_bounds)) time_bounds=time_bounds;valid_area_per_gridcell=area*landfrac;' output/W_SCALAR_198501_198912.nc output/W_SCALAR_198501_198912.nc

@czender
Copy link

czender commented May 29, 2025

Thank for running the short test. I'll look at this output tomorrow and see if I can reproduce and then fix this behavior.

@czender
Copy link

czender commented May 29, 2025

Please send me the full absolute path to the land input files used in this test

@forsyth2
Copy link
Collaborator Author

@czender

input:

> ls /lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist | head
v3.LR.historical_0051.elm.h0.1850-01.nc
v3.LR.historical_0051.elm.h0.1850-02.nc
v3.LR.historical_0051.elm.h0.1850-03.nc
v3.LR.historical_0051.elm.h0.1850-04.nc
v3.LR.historical_0051.elm.h0.1850-05.nc
v3.LR.historical_0051.elm.h0.1850-06.nc
v3.LR.historical_0051.elm.h0.1850-07.nc
v3.LR.historical_0051.elm.h0.1850-08.nc
v3.LR.historical_0051.elm.h0.1850-09.nc
v3.LR.historical_0051.elm.h0.1850-10.nc

relevant excerpt of /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051/post/scripts/ts_lnd_monthly_glb_1985-1989-0005.bash:

# Create symbolic links to input files                                                                                                                                          
input=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
for (( year=1985; year<=1989; year++ ))
do
  YYYY=`printf "%04d" ${year}`
  for file in ${input}/v3.LR.historical_0051.elm.h0.${YYYY}-*.nc
  do
    ln -s ${file} .
  done
done

vars=TBOT,QBOT,W_SCALAR
# https://unix.stackexchange.com/questions/237297/the-fastest-way-to-remove-a-string-in-a-variable                                                                              
# https://stackoverflow.com/questions/26457052/remove-a-substring-from-a-bash-variable                                                                                          
# Remove U, since it is a 3D variable and thus will not work with rgn_avg                                                                                                       
vars=${vars//,U}

ls v3.LR.historical_0051.elm.h0.????-*.nc > input.txt
if grep -q "*" input.txt; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051/post/scripts
  echo 'Missing input files'
  echo 'ERROR (1)' > ts_lnd_monthly_glb_1985-1989-0005.status
  exit 1
fi
# Generate time series files                                                                                                                                                    
# If the user-defined parameter "vars" is "", then ${vars}, defined above, will be too.                                                                                         
cat input.txt | /home/ac.zender/bin_chrysalis/ncclimo --npo \
-c v3.LR.historical_0051 \
-v ${vars} \
--split \
--yr_srt=1985 \
--yr_end=1989 \
--ypf=5 \
-o output \
--rgn_avg \
--area=area \
--prc_typ=elm

output:

> cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051/post/scripts/

> grep -v "OK" *status
global_time_series_1985-1995.status:WAITING 752785
ts_lnd_monthly_glb_1985-1989-0005.status:ERROR (2)
ts_lnd_monthly_glb_1990-1994-0005.status:ERROR (2)

> cat ts_lnd_monthly_glb_1985-1989-0005.o752783 
Climatology operations invoked with command:
/home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 -v TBOT,QBOT,W_SCALAR --split --yr_srt=1985 --yr_end=1989 --ypf=5 -o output --rgn_avg --area=area --prc_typ=elm
Started climatology splitting at Thu May 29 14:02:36 CDT 2025
Running climatology script ncclimo from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
NCO binaries version 5.3.4-alpha05 from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
Parallelism mode = background
Timeseries will be created for each of 3 variables
Background parallelism processing variables in var_nbr/job_nbr = 3/3 = 1 sequential batches each concurrently processing job_nbr = 3 jobs (1 per variable), then remaining 0 jobs/variables simultaneously
Will split data for each variable into one timeseries of length 5 years and 0 months
Splitting climatology from list of 60 raw input files piped to stdin
Each input file assumed to contain statistics for one month
Hemispherically and globally averaged timeseries files to be saved to directory output
Split files will not be regridded
Thu May 29 14:02:45 CDT 2025: Generated output/TBOT_198501_198912.nc
Thu May 29 14:02:45 CDT 2025: Generated output/QBOT_198501_198912.nc
Thu May 29 14:02:45 CDT 2025: Generated output/W_SCALAR_198501_198912.nc
ncap2: ERROR 1 dimensions of W_SCALAR_tmp belong to template Internally generated template but 1 dimensions do not
Thu May 29 14:02:49 CDT 2025: Global and regional statistics output/TBOT_198501_198912.nc
Thu May 29 14:02:49 CDT 2025: Global and regional statistics output/QBOT_198501_198912.nc
ncclimo: ERROR Failed in global and regional statistics. cmd_stt[2] failed. Debug this:
 OMP_PROC_BIND=false ncap2 -h -O -s '*rgn_nbr=3;defdim("rgn",rgn_nbr);*W_SCALAR_tmp=0.0f*W_SCALAR.avg($lat,$lon);*W_SCALAR_rgn[time,rgn]=W_SCALAR_tmp;W_SCALAR_rgn@coordinates="region_name";*lat_area=lat+0.0*area;*msk_sth=0*lat_area.int();delete_miss(msk_sth);*msk_nrt=0*lat_area.int();delete_miss(msk_nrt);*idx_glb=0;*idx_nrt=1;*idx_sth=2;*rgn_len=19;defdim("rgn_len",rgn_len);region_name[rgn,rgn_len]=" ";region_name(idx_glb,0:5)="Global";region_name(idx_nrt,:)="Northern Hemisphere";region_name(idx_sth,:)="Southern Hemisphere";if(0) region_name@long_name="W_SCALAR timeseries array contains area-weighted sums over these regions"; else region_name@long_name="W_SCALAR timeseries array contains area-weighted averages over these regions";where(lat_area < 0.0) msk_sth=1; elsewhere msk_nrt=1;W_SCALAR_rgn(:,idx_glb)=((W_SCALAR*area*landfrac).avg($lat,$lon)/(area*landfrac).avg($lat,$lon)).float();W_SCALAR_rgn(:,idx_nrt)=((W_SCALAR*area*landfrac*msk_nrt).avg($lat,$lon)/(area*landfrac*msk_nrt).avg($lat,$lon)).float();W_SCALAR_rgn(:,idx_sth)=((W_SCALAR*area*landfrac*msk_sth).avg($lat,$lon)/(area*landfrac*msk_sth).avg($lat,$lon)).float();if(0){W_SCALAR_rgn(:,idx_glb)=W_SCALAR_rgn(:,idx_glb)*(area*landfrac).total($lat,$lon).float()*1.0f;W_SCALAR_rgn(:,idx_nrt)=W_SCALAR_rgn(:,idx_nrt)*(area*landfrac*msk_nrt).total($lat,$lon).float()*1.0f;W_SCALAR_rgn(:,idx_sth)=W_SCALAR_rgn(:,idx_sth)*(area*landfrac*msk_sth).total($lat,$lon).float()*1.0f;}W_SCALAR=W_SCALAR_rgn;push(&W_SCALAR@cell_methods," area: mean");if(exists(time_bnds)) time_bnds=time_bnds;if(exists(time_bounds)) time_bounds=time_bounds;valid_area_per_gridcell=area*landfrac;' output/W_SCALAR_198501_198912.nc output/W_SCALAR_198501_198912.nc

cfg excerpt:

input = /lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051
input_subdir = archive/lnd/hist
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_www/test_704_20250529_minimal_vars"
Full cfg
[default]
case = "v3.LR.historical_0051"
constraint = ""
dry_run = "False"
environment_commands = ""
input = /lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_704_20250529_minimal_vars/v3.LR.historical_0051"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_www/test_704_20250529_minimal_vars"

[ts]
active = True
walltime = "00:30:00"
years = "1985:1995:5",

  [[ atm_monthly_glb ]]
  # Note global average won't work for 3D variables.
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ lnd_monthly_glb ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  mapping_file = "glb"
  #vars = "" # Get all available variables
  vars = "TBOT,QBOT,W_SCALAR"

[global_time_series]
active = True
environment_commands = "source <INSERT PATH TO CONDA>/conda.sh; conda activate <INSERT ENV NAME>"
experiment_name = "v3.LR.historical_0051"
figstr = "v3.LR.historical_0051"
make_viewer = True
# We have to set plots_original to the 5 plots that don't require ocean.
plots_original="net_toa_flux_restom,global_surface_air_temperature,toa_radiation,net_atm_energy_imbalance,net_atm_water_imbalance"
plots_atm = "TREFHT"
plots_lnd = "FSH,RH2M,LAISHA,LAISUN"
ts_num_years = 5
walltime = "00:30:00"
years = "1985-1995",

@czender
Copy link

czender commented Jun 6, 2025

Sorry for the delay. I can reproduce this behavior. It looks like the 3D variables are not being discarded as I thought they would be. Definitely an issue on my end. Hope to fix by early next week...

@czender
Copy link

czender commented Jun 6, 2025

@forsyth2 I spoke too soon. The code is expected to fail when the user explicitly requests a global mean timeseries of a 3D variable. This is in keeping with the general NCO philosophy of failing when asked to do what cannot be done. So it fails when W_SCALAR is explicitly requested. However, ncclimo should definitely work, and it does work for me when no variable list is given. In that case ncclimo generates a list of all the 2D variables and successfully computes timeseries of them (at least for me). This is the mode that your zppy test actually tests. Creating the 1985--1994 five-year timeseries of a all 2d variables in the historical simulation you pointed me to (just like your zppy test attempts) takes me about 3m10s on a Chrysalis login node. It should not take a vastly longer amount of time in the Chrysalis queue yet you said the time limit was exceeded. What was the time limit? (I think your original directory has vanished).

@czender
Copy link

czender commented Jun 6, 2025

ac.zender@chrlogin1:~$ drc_in=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
ac.zender@chrlogin1:~$ drc_out=${HOME}/ryan
ac.zender@chrlogin1:~$ cd ${drc_in};ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm
ncclimo: WARNING Splitter mode without explicitly specified variable list (i.e., -v var_lst) splits all variables of rank >= 2 into separate files, thus doubling the on-disk data amount
Climatology operations invoked with command:
/home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o /home/ac.zender/ryan --rgn_avg --area=area --prc_typ=elm
Started climatology splitting at Fri Jun  6 15:46:45 PDT 2025
Running climatology script ncclimo from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
NCO binaries version 5.3.4-alpha06 from directory /gpfs/fs1/home/ac.zender/bin_chrysalis
Parallelism mode = background
Timeseries will be created for each of 353 variables
Background parallelism processing variables in var_nbr/job_nbr = 353/353 = 1 sequential batches each concurrently processing job_nbr = 353 jobs (1 per variable), then remaining 0 jobs/variables simultaneously
WARNING: Requested number of simultaneous jobs = job_nbr = 353 exceeds threshold number = job_nbr_wrn = 150. This command will start an unusually (and possibly inadvertently) large number of splitter tasks for most computers. Consequences may include insufficient RAM that leads to swapping, slow performance due to I/O contention when reading/writing data.
HINT: If undesirable performance occurs, use the --job_nbr option to reduce the number of simultaneous jobs, e.g., ncclimo --job_nbr=100 ...
All this occurs within an outer loop of (yr_sbs/ypf_max) + (remainder, if any) = 11/5 + 1 = 3 time segments
Will split data for each variable into 2 timeseries segment(s) of length 5 years and 1 segment of length 1 year(s)
Splitting climatology from list of 132 raw input files piped to stdin
Each input file assumed to contain statistics for one month
Hemispherically and globally averaged timeseries files to be saved to directory /home/ac.zender/ryan
Split files will not be regridded
Fri Jun  6 15:47:35 PDT 2025: Generated /home/ac.zender/ryan/ACTUAL_IMMOB_198501_198912.nc
Fri Jun  6 15:47:35 PDT 2025: Generated /home/ac.zender/ryan/ACTUAL_IMMOB_P_198501_198912.nc
...
Fri Jun  6 15:49:38 PDT 2025: Global and regional statistics /home/ac.zender/ryan/ZWT_199501_199512.nc
Fri Jun  6 15:49:38 PDT 2025: Global and regional statistics /home/ac.zender/ryan/ZWT_CH4_UNSAT_199501_199512.nc
Fri Jun  6 15:49:38 PDT 2025: Global and regional statistics /home/ac.zender/ryan/ZWT_PERCH_199501_199512.nc
Quick plots of last timeseries segment of last variable split:
ncvis /home/ac.zender/ryan/ZWT_PERCH_199501_199512.nc &
Completed 11-year climatology operations for dataset with caseid = v3.LR.historical_0051 at Fri Jun  6 15:49:38 PDT 2025
Elapsed time 2m53s
ac.zender@chrlogin1:/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051/archive/lnd/hist$ 

@czender
Copy link

czender commented Jun 19, 2025

FYI I released NCO 5.3.4 which is what the --npo option will currently point to. It includes changes that absolutely prevent it from trying to do regional timeseries on 3D variables, though that was not a problem with your ELM tests that were timing out. 5.3.4 will actually take about a minute longer to complete the ELM test above due to this change. This should still be acceptable. The problematic issue is that it times out for you with zppy. To me this seems to indicate an environment issue related to running the code in zppy, or via slurm and that's what we need to get to the bottom of. Thoughts?

@forsyth2
Copy link
Collaborator Author

@czender Thanks for the continued work here! Unfortunately, I have several other priorities at the moment, so it may be a bit before I can get back to this. Will give an update when I can.

@forsyth2
Copy link
Collaborator Author

@czender Sorry, I finally had some time to get back to this, but now I can't seem to get your custom NCO to work at all.

Command line, no conda, `bin_chyrsalis/ncclimo`
# https://github.com/E3SM-Project/zppy/pull/717#issuecomment-2951125003
test_num=2
caseid=v3.LR.historical_0051
drc_in=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
drc_out=/lcrc/group/e3sm/ac.forsyth2/issue_704_test${test_num}
cd ${drc_in}
ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm

gives:

ncclimo: ERROR /home/ac.zender/bin_chrysalis/ncks dies with error message on next line:
ncks: error while loading shared libraries: libudunits2.so.0: cannot open shared object file: No such file or directory
Command line, Unified, `bin_chyrsalis/ncclimo`

Same script as above gives many iterations of

/home/ac.zender/bin_chrysalis/ncclimo: eval: line 3017: syntax error near unexpected token `('

That seems to me to be an error _inside _ncclimo, since I'm not seeing any ( in my command, but I could be mistaken.

Running zppy, Unified, `bin_chyrsalis/ncclimo`

Cfg:

#!/bin/bash

# Running on chrysalis

#SBATCH  --job-name=ts_lnd_monthly_glb_1985-1989-0005
#SBATCH  --account=e3sm
#SBATCH  --nodes=1
#SBATCH  --output=/lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/scripts/ts_lnd_monthly_glb_1985-1989-0005.o%j
#SBATCH  --exclusive
#SBATCH  --time=00:30:00


#SBATCH  --partition=debug


# Turn on debug output if needed
debug=False
if [[ "${debug,,}" == "true" ]]; then
  set -x
fi

# Script dir
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/scripts

# Get jobid
id=${SLURM_JOBID}

# Update status file
STARTTIME=$(date +%s)
echo "RUNNING ${id}" > ts_lnd_monthly_glb_1985-1989-0005.status
source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh

# Create temporary workdir
hash=`mktemp --dry-run -d XXXX`
workdir=tmp.ts_lnd_monthly_glb_1985-1989-0005.${id}.${hash}
mkdir ${workdir}
cd ${workdir}

# Create symbolic links to input files
input=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
for (( year=1985; year<=1989; year++ ))
do
  YYYY=`printf "%04d" ${year}`
  for file in ${input}/v3.LR.historical_0051.elm.h0.${YYYY}-*.nc
  do
    ln -s ${file} .
  done
done

vars=
# https://unix.stackexchange.com/questions/237297/the-fastest-way-to-remove-a-string-in-a-variable
# https://stackoverflow.com/questions/26457052/remove-a-substring-from-a-bash-variable
# Remove U, since it is a 3D variable and thus will not work with rgn_avg
vars=${vars//,U}

ls v3.LR.historical_0051.elm.h0.????-*.nc > input.txt
if grep -q "*" input.txt; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/scripts
  echo 'Missing input files'
  echo 'ERROR (1)' > ts_lnd_monthly_glb_1985-1989-0005.status
  exit 1
fi
# Generate time series files
# If the user-defined parameter "vars" is "", then ${vars}, defined above, will be too.
cat input.txt | /home/ac.zender/bin_chrysalis/ncclimo --npo \
-c v3.LR.historical_0051 \
--split \
--yr_srt=1985 \
--yr_end=1989 \
--ypf=5 \
-o output \
--rgn_avg \
--area=area \
--prc_typ=elm



if [ $? != 0 ]; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/scripts
  echo 'ERROR (2)' > ts_lnd_monthly_glb_1985-1989-0005.status
  exit 2
fi

# Move output ts files to final destination
{
  dest=/lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/lnd/glb/ts/monthly/5yr
  mkdir -p ${dest}
  mv output/*.nc ${dest}
}
if [ $? != 0 ]; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250723/v3.LR.historical_0051/post/scripts
  echo 'ERROR (3)' > ts_lnd_monthly_glb_1985-1989-0005.status
  exit 3
fi

# Delete temporary workdir
cd ..
if [[ "${debug,,}" != "true" ]]; then
  rm -rf ${workdir}
fi

# Update status file and exit

ENDTIME=$(date +%s)
ELAPSEDTIME=$(($ENDTIME - $STARTTIME))

echo ==============================================
echo "Elapsed time: $ELAPSEDTIME seconds"
echo ==============================================
rm -f ts_lnd_monthly_glb_1985-1989-0005.status
echo 'OK' > ts_lnd_monthly_glb_1985-1989-0005.status
zppy -c tests/integration/generated/test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg

gives:

ncclimo: ERROR /home/ac.zender/bin_chrysalis/ncks dies with error message on next line:
ncks: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory

I've seen this shared object file error before but I don't think I've ever been able to deduce what causes it, besides attributing it to parallelism, e.g., #180

@czender
Copy link

czender commented Jul 23, 2025

Thanks for the info, @forsyth2. You are running the bleeding edge NCO, and it appears you've found an ncclimo syntax error. I'll look into this soon and let you know when I've made the necessary updates.

@forsyth2 forsyth2 added the priority: high High priority task (for next release) label Aug 9, 2025
@chengzhuzhang
Copy link
Collaborator

@czender I'm helping @thorntonpe to create global time series plots for all 2d land variables and looks like I need to use your latest snapshot for generating global metrics, without manually removing 3d vars. Not sure if the latest nco image on LCRC is implemented with this feature?

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 2, 2025

it appears you've found an ncclimo syntax error. I'll look into this soon and let you know when I've made the necessary updates.

Hey @czender, just checking in if you've had a chance to look into this yet. Thanks!

@czender
Copy link

czender commented Sep 3, 2025

@forsyth2 I'm looking into it now. I have an idea as to what may be causing the issues you report. I'll try it tommorrow.

@czender
Copy link

czender commented Sep 4, 2025

I'm making progress on this, and will have a new implementation you can try tomorrow or Friday.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 4, 2025

Great, thanks @czender!

@czender
Copy link

czender commented Sep 4, 2025

@forsyth2 I re-factored the --npo functionality that I think caused the problems you experienced. Would you please retry the tests mentioned above on July 23? I have high hopes that tests 1 and 3 will now pass.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 4, 2025

I have high hopes that tests 1 and 3 will now pass.

Test 3 (ncclimo called via zppy) does indeed appear to work now! Interestingly, test 1 (ncclimo called directly) still gives ncks: error while loading shared libraries: libgsl.so.28: cannot open shared object file: No such file or directory

cd ~/ez/zppy
lcrc_conda # Function to activate conda
conda clean --all --y
conda env create -f conda/dev.yml -n zppy-704-20250904
conda activate zppy-704-20250904

# Call ncclimo directly, with Unified environment
# 2nd test case from 7/23 run
unified # Alias to set up unified environment
test_num=1
caseid=v3.LR.historical_0051
drc_in=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
drc_out=/lcrc/group/e3sm/ac.forsyth2/issue_704_20250904test${test_num}
cd ${drc_in}
ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm
# syntax error near unexpected token `)'

# Call ncclimo directly, without Unified environment
# 1st test case from 7/23 run
lcrc_conda
test_num=2
drc_out=/lcrc/group/e3sm/ac.forsyth2/issue_704_20250904test${test_num}
ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm
# ncks: error while loading shared libraries: libgsl.so.28: cannot open shared object file: No such file or directory

# Call ncclimo via zppy
# 3rd test case from 7/23 run
cd ~/ez/zppy
conda activate zppy-704-20250904
git log
# Commits:
# Testing
# Updates
# Latest NCO removes 3D vars (only commit currently pushed to GitHub)
# Merge pull request #723 from E3SM-Project/fix-pm-tc-analysis
# Edit tests/intgeration/utils.py
# UNIQUE_ID = "test_pr717_20250904"
#        "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
#        "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi-main-20250822",
python tests/integration/utils.py
python -m pip install .
zppy -c tests/integration/generated/test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250904/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# No errors
cd ~/ez/zppy
emacs zppy/templates/ts.bash 
# Uses latest NCO
# /home/ac.zender/bin_chrysalis/ncclimo --npo

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 4, 2025

@czender It's possible I'm just not calling ncclimo correctly as a stand-alone call (test case 1), but it's good that the main use case works.

Also, I suppose we'll again have the situation of needing to remove the /home/ac.zender/bin_chrysalis/ and --npo during the Unified testing period (i.e., once the latest NCO is in the test Unified)

@czender
Copy link

czender commented Sep 5, 2025

Thanks for re-testing. That's good progress. Unfortunately the nature of the --npo option is difficult for me to debug because it's an option intended for use by others. It attempts to run my executables using my environment. When I run with that option it always works because my environment agrees with my environment. I will give further thought to why your runs do not find the correct libudunits...
And yes, the final invocation in zppy will need to remore the/home/ac.zender/bin_chrysalis/ path and the --npo option.

@czender
Copy link

czender commented Sep 5, 2025

@forsyth2 I just noticed that the failure in your test #1 changed from not finding the right libudunits (before) to not finding the right libgsl (now). It appears the libudunits link issue was fixed. One possibility is that you have a different version of libgsl that pre-empts the one in my directory. To test this please run and post the results of gsl-config --prefix. Also I just uploaded a new ncclimo snapshot to output more debugging info to help. Please post the output from re-running Test #1 with it.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 5, 2025

@czender Ah that fixes it, looks like both (1) and (3) work now.

# 1st test case from 7/23 run
lcrc_conda
gsl-config --prefix
# -bash: gsl-config: command not found

# 2nd test case from 7/23 run
unified
gsl-config --prefix
# /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.11.1_login

# 3rd test case from 7/23 run
lcrc_conda
conda activate zppy-704-20250904
gsl-config --prefix
# -bash: gsl-config: command not found

and

# Call ncclimo directly, without Unified environment
# 1st test case from 7/23 run
lcrc_conda
test_num=2
caseid=v3.LR.historical_0051
drc_in=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
drc_out=/lcrc/group/e3sm/ac.forsyth2/issue_704_20250905test${test_num}
cd ${drc_in}
ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm
# 12m41s runtime, success!

# Call ncclimo via zppy
# 3rd test case from 7/23 run
cd ~/ez/zppy
git checkout issue-704
conda activate zppy-704-20250904
# Edit tests/intgeration/utils.py
# UNIQUE_ID = "test_pr717_20250905"
#        "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
#        "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi-main-20250822",
python tests/integration/utils.py
python -m pip install .
zppy -c tests/integration/generated/test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250905/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# No errors, success!
cd ~/ez/zppy
emacs zppy/templates/ts.bash 
# Uses latest NCO
# /home/ac.zender/bin_chrysalis/ncclimo --npo

@czender
Copy link

czender commented Sep 5, 2025

Good, though odd. I did not change the feature since yesterday. Today I just added new output to help debug. I'm not sure why Test #1 suceeds today, given that it failed yesterday. But I also thought #1 and #3 would both succeed yesterday...

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 5, 2025

Hmm, that is interesting. I've seen loading shared libraries errors before and they usually involve parallelism and are difficult to reproduce consistently, so I'm wondering it that's what happened here.

@czender
Copy link

czender commented Sep 5, 2025

My understanding of the problem is that #1 and #3 were not searching the same directory that contains the shared libraries that my binaries need to access. Nothing related to parallelism that I can see. Maybe run tests #1 and #3 a few more times if you can to see if any errors arise.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 5, 2025

Hmm ran test case 1 again, still ran just fine.

@forsyth2
Copy link
Collaborator Author

@czender Can I merge this or are you concerned about test 1? Test 3 is the important use case here, being as it actually uses zppy.

@forsyth2 forsyth2 marked this pull request as ready for review September 16, 2025 00:45
@czender
Copy link

czender commented Sep 16, 2025

I would say run the test one more time. If the glitch with #1 does not repeat then merge it. However, know that there have been other changes to NCO in the interim, so there could be a failure due to unrelated causes.

@forsyth2
Copy link
Collaborator Author

Looks like both test cases are still working:

# Call ncclimo directly, without Unified environment
# 1st test case from 7/23 run
lcrc_conda
test_num=3
caseid=v3.LR.historical_0051
drc_in=/lcrc/group/e3sm2/ac.wlin//E3SMv3/v3.LR.historical_0051/archive/lnd/hist
drc_out=/lcrc/group/e3sm/ac.forsyth2/issue_704_20250905test${test_num}
cd ${drc_in}
ls ${caseid}.elm.h0.198[5-9]-??.nc ${caseid}.elm.h0.199[0-5]-??.nc | /home/ac.zender/bin_chrysalis/ncclimo --npo -c v3.LR.historical_0051 --split --yr_srt=1985 --yr_end=1995 --ypf=5 -o ${drc_out} --rgn_avg --area=area --prc_typ=elm
# Elapsed time 14m36s
# Success

# Call ncclimo via zppy
# 3rd test case from 7/23 run
cd ~/ez/zppy
git status
# Check for uncommitted changes
git checkout issue-704
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-nco-20250916
conda activate zstash-nco-20250916
git fetch upstream
git rebase upstream/main
git log # Includes latest changes on `main` branch
# Edit tests/intgeration/utils.py
# TEST_SPECIFICS: Dict[str, Any] = {
#     "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
#     "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi-main-20250822",
#     "cfgs_to_run": [
#         "min_case_global_time_series_viewers_original_atm_plus_land",
#     ],
#     "tasks_to_run": ["e3sm_diags", "mpas_analysis", "global_time_series", "ilamb"],
#     "unique_id": "test_pr717_20250916",
# }
python tests/integration/utils.py
python -m pip install .
zppy -c tests/integration/generated/test_min_case_global_time_series_viewers_original_atm_plus_land_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_global_time_series_viewers_original_atm_plus_land_output/test_pr717_20250916/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# No errors, success!
cd ~/ez/zppy
emacs zppy/templates/ts.bash 
# Good, uses latest NCO:
# /home/ac.zender/bin_chrysalis/ncclimo --npo

@forsyth2
Copy link
Collaborator Author

Based on the above testing, I'm going to merge.

@forsyth2 forsyth2 merged commit cabe0cd into main Sep 16, 2025
4 checks passed
@forsyth2 forsyth2 deleted the issue-704 branch September 16, 2025 20:05
@forsyth2 forsyth2 mentioned this pull request Oct 27, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority: high High priority task (for next release) semver: small improvement Small improvement (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use NCO to automatically exclude 3D variables

4 participants