zppy debugging guide for developers #573
forsyth2
announced in
Announcements
Replies: 1 comment
-
|
It might be a good idea to include some debugging examples on this discussion page. With that in mind, I'm starting with this example: Debugging example 2025-10-27Techniques used
DetailsDebugging stepsOn Perlmutter: # Note: I have the following aliases set up:
# alias sqa='squeue -o "%8u %.7a %.4D %.9P %8i %.2t %.10r %.10M %.10l %j" --sort=P,-t,-p'
# alias sq='sqa -u forsyth'
cd ~/zppy_support/
sq
# No jobs in the queue
emacs debug_zppy_calandrini.cfg
# Copy the user's cfg
cp debug_zppy_calandrini.cfg debug_zppy_calandrini_v2.cfg
# Make a copy that we can edit
emacs debug_zppy_calandrini_v2.cfg
# Make edits:
# output = /pscratch/sd/f/forsyth/debug_zppy_calandrini/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5
# www = /global/cfs/cdirs/e3sm/www/forsyth
# Debugging round 1a ##########################################################
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh
zppy -c debug_zppy_calandrini_v2.cfg
# RuntimeError: Problem submitting script /pscratch/sd/f/forsyth/debug_zppy_calandrini/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/climo_atm_monthly_90x180_aave_2012-2014.bash
# sbatch: error: No available node hour balance information for user forsyth
# No available node hour balance
# That implies I'm either out of node hours or something is wrong with the account
# Solution: comment out `account = m4310`
sq
# No jobs in the queue
# Debugging round 1b ##########################################################
zppy -c debug_zppy_calandrini_v2.cfg
# zppy.utils.ParameterNotProvidedError: ts_subsection is required because the sets {'qbo', 'enso_diags'} were requested.
# ts_subsection is required
# Ok, let's set it.
# Solution: add `ts_subsection="atm_monthly_90x180_aave"`
sq
# 3 jobs in the queue
scancel -u forsyth
# Let's cancel those since we're going to re-run
# Debugging round 2 ###########################################################
# Update paths:
# output = /pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5
# www = /global/cfs/cdirs/e3sm/www/forsyth/debug_zppy_calandrini_try2/
zppy -c debug_zppy_calandrini_v2.cfg
# e3sm_diags_atm_monthly_90x180_aave_mvm_model_vs_model_2012-2014_vs_2012-2014
# ...skipping because of dependency status file missing
# /pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/climo_atm_monthly_90x180_aave_mvm_2012-2014.status
# environment_commands=source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh
# URL: https://portal.nersc.gov/cfs/e3sm/forsyth/debug_zppy_calandrini_try2//20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/e3sm_diags
# global_time_series_2012-2014
# ...skipping because of dependency status file missing
# /pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/mpas_analysis_ts_2012-2014_climo_2012-2014.status
# environment_commands=source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh
# URL: https://portal.nersc.gov/cfs/e3sm/forsyth/debug_zppy_calandrini_try2//20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/global_time_series
# Ok, we're skipping jobs because some dependencies are missing
sq
# 3 jobs in the queue
scancel -u forsyth
# Let's cancel those since we're going to re-run
# Let's check the .settings files:
cd /pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts
emacs e3sm_diags_atm_monthly_90x180_aave_mvm_model_vs_model_2012-2014_vs_2012-2014.settings
# 'dependencies': [ '/pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/climo_atm_monthly_90x180_aave_mvm_2012-2014.status',
# '/pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/ts_atm_monthly_90x180_aave_2012-2014-0003.status'],
emacs global_time_series_2012-2014.settings
# 'dependencies': [ '/pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/ts_atm_monthly_glb_2012-2014-0003.status',
# '/pscratch/sd/f/forsyth/debug_zppy_calandrini_try2/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts/mpas_analysis_ts_2012-2014_climo_2012-2014.status'],
# What .status files do we actually have?
ls *.status
# climo_atm_monthly_90x180_aave_2012-2014.status ts_atm_monthly_90x180_aave_2012-2014-0003.status ts_atm_monthly_glb_2012-2014-0003.status
# Solution: add `climo_subsection="atm_monthly_90x180_aave"` and `plots_original=""`
# Debugging round 3 ###########################################################
cd ~/zppy_support/
emacs debug_zppy_calandrini_v2.cfg
# Update paths:
# output = /pscratch/sd/f/forsyth/debug_zppy_calandrini_try3/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5
# www = /global/cfs/cdirs/e3sm/www/forsyth/debug_zppy_calandrini_try3/
zppy -c debug_zppy_calandrini_v2.cfg
# No failures on launch
sq
# 5 jobs in the queue
# Interestingly, `global_time_series_2012-2014` is listed with `Priority` rather than `Dependency`
cd /pscratch/sd/f/forsyth/debug_zppy_calandrini_try3/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5/post/scripts
cat global_time_series_2012-2014.settings
# 'dependencies': [],
# As predicted, global time series doesn't have the proper dependencies set up.
sq
# Still 5 jobs in the queue
scancel -u forsyth
# Cancel, because we need to fix the dependency issue
# `[global_time_series]` didn't actually have any plots_<component> specified,
# so we accidentially asked it to plot nothing,
# which of course triggered no dependencies.
# Looking at `zppy/defaults/default.ini`, we see:
# Remove the 3 ocean plots (change_ohc,max_moc,change_sea_level) if you don't have ocean data.
#plots_original = string(default="net_toa_flux_restom,global_surface_air_temperature,toa_radiation,net_atm_energy_imbalance,change_ohc,max_moc,change_sea_level,net_atm_water_imbalance")
# Solution: plots_original="net_toa_flux_restom,global_surface_air_temperature,toa_radiation,net_atm_energy_imbalance,net_atm_water_imbalance"
# Debugging round 4 ###########################################################
cd ~/zppy_support/
emacs debug_zppy_calandrini_v2.cfg
# Update paths:
# output = /pscratch/sd/f/forsyth/debug_zppy_calandrini_try4/20251016.FcaseNudged101.F20TR.ne30pg2_r05_IcoswISC30E3r5
# www = /global/cfs/cdirs/e3sm/www/forsyth/debug_zppy_calandrini_try4/
zppy -c debug_zppy_calandrini_v2.cfg
# No failures on launch
sq
# 5 jobs in the queue
# And now `global_time_series_2012-2014` is correctly listed as requiring a dependency to complete first.
# Now, we have to wait to get Perlmutter nodesThe cfg |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
zppyhas many complex inter-connected pieces. It can therefore be a challenging package to develop and debug (and debuggingzppycould well mean debugging a package it calls).This guide aims to be a starting point for developers trying to debug
zppy.Check if the issue has come up before
If you have a specific line of code to search for (e.g.
Error: ...), you can use the GitHub search bar in the upper right hand corner of thezppyrepo to search for that line. It may have appeared in previous issues/PRs/discussions.For more complicated questions, you can also look through the discussions page.
zppyskipped a job it shouldn't havezppywill report what dependency is missing when it skips a job. Look at your cfg to determine whyzppyis looking for that dependency. See #544 (comment) for an in-depth look at zppy dependency handling.A job may also be skipped because its status file says "RUNNING", "WAITING", or "OK". Usually that means this job really shouldn't re-run anyway. However, it may be the case that you found a bug for which
zppydoesn't exit with an error code. In these cases, simply delete the status file and re-runzppy.Identifying errors in
.ofilesFrom #291 & https://e3sm-project.github.io/zppy/_build/html/main/tutorial.html#debugging-failures:
The error is in a prior task
For example, if you realize your error in
global_time_seriesis really because of an error ints, then you'll need to fix thetstask and then re-runzppy. It's recommended to either delete the oldoutputandwwwdirectories or set them to a new path so you know you aren't re-using old output.Reduce the number of jobs you have
zppylaunch by deleting or commenting out everything in yourcfgthat's not involved with the debugging. (E.g., if you're debuggingglobal_time_series, you may need to re-run thetstask dependencies, but you don't need theclimotask to re-run).The error is in a package zppy calls
#570 provides a chart of which tasks use which packages. If the bug is ultimately in another package, then that package needs to be updated. Then, you can use
environment_commands(ore3sm_to_cmip_environment_commands) to set a different environment so you can test zppy with the fixed version of the underlying package. Directions on how to do this can also be found in #570.Could the problem be environment, data, or machine (rather than
zppyitself)?Does this problem resolve when...
zppytasks in the wrong environment by forgetting to setenvironment_commandsaccordingly or forgetting to runpip install .to apply the latest changes. So, double check yourcfgto confirm environments are set correctly and/or try creating a new dev environment:EDIT 2025-10-27: We're now using conda rather than mamba
input? Perhaps the data, not the code, is faulty. Or if one dataset works and another doesn't, that may tell us something about where the code may be broken.Learning more about the data you have as input
You can run
ncdump -h <file-name>to get a summary of data in files underneath yourinputdirectory.ncdump -h <file-name> | grep floatwill show you thefloatvariables defined in the file.ncdump -h <file-name> | grep -E "float (var1|var2|...|varN)\("to find specificfloatvariables defined in the file.It may be the case that the variables you're trying to process aren't even defined in your input file. In that case, the problem is with the data you're using -- either you need to find simulation output with the required variables or you need to remove the variables from your processing list (e.g.,
vars,plots_atm)Creating a MCVE
It can be helpful to reduce a problem to the smallest possible example size -- a minimal complete verifiable example (MCVE). This is helpful both to you as a debugger and to others you show the problem too.
For example, from the
zstashBug Report template (https://github.com/E3SM-Project/zstash/issues/new/choose):This can be a real challenge in
zppysince often the bug arises out of the many inter-connected pieces (hence why the MCVE question isn't even on thezppybug report template). Sometimes though, it is possible. In these cases, creating a MCVE can be quite helpful.For example, when debugging
global_time_series, if you've identified a problem is incoupled_global.py-- you don't need to re-runzppyor evenglobal_time_series.bash-- just look atglobal_time_series.bashto identify what parameters were used in the call tocoupled_global.pyand runcoupled_global.pywith those parameters yourself.In rare cases it may even be possible to reduce the problem to a few lines of Python, in which case you can debug the problem in an interactive Python interpreter.
In most cases, the simplest way to make a MCVE is to create a minimal
cfg: run on as few years as possible, run as few tasks as possible, run on as few variables as possible -- what specific parameter combination causes the problem?Write a test
Once you find a bug, think if there's a test you can write that would catch this bug in the future. E.g., what combination of parameters or type of data causes this bug to appear? If we can get a test into the test suite for this bug, then it prevents future users from running into it too.
Beta Was this translation helpful? Give feedback.
All reactions