Skip to content

Conversation

@tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Mar 18, 2025

Description

This PR centralizes all output files for an e3sm_to_cmip run into a single directory and improves logging to eliminate duplicate messages while adding useful metadata.

Key Changes

1. Centralized Output Directory (output_path)

📌 Note: If --output_path is not provided, a default directory is created in the current working directory with a timestamped name, e.g., e3sm_to_cmip_run_20250319_184835_367503.

2. Improved Logging System

💡 Caveat: Warnings raised during import (e.g., esmpy version warnings) will still appear in the console but won’t be captured in log files. This trade-off enables dynamic log file paths.

3. Replaces print_message() and print_debug() with Logger

  • Ensures all messages are properly captured in log files.
  • Removes print() statements (which don’t get logged).
  • Colors and text formatting are no longer used in log files.

Examples

Serial mode

2025-03-19 14:16:01,645 [INFO]: __main__.py(__init__:160) >> --------------------------------------
2025-03-19 14:16:01,646 [INFO]: __main__.py(__init__:161) >> | E3SM to CMIP Configuration
2025-03-19 14:16:01,647 [INFO]: __main__.py(__init__:162) >> --------------------------------------
2025-03-19 14:16:01,656 [INFO]: __main__.py(__init__:185) >>     * Timestamp: 20250319_191551_378453
2025-03-19 14:16:01,656 [INFO]: __main__.py(__init__:185) >>     * Version Info: branch feature/274-redesign-logger with commit b44e78d314741f0a948f966a78b2ad558ce193d4
2025-03-19 14:16:01,657 [INFO]: __main__.py(__init__:185) >>     * Mode: Serial
2025-03-19 14:16:01,657 [INFO]: __main__.py(__init__:185) >>     * Variable List: ['pfull', 'phalf', 'tas']
2025-03-19 14:16:01,658 [INFO]: __main__.py(__init__:185) >>     * Input Path: [/lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded)
2025-03-19 14:16:01,658 [INFO]: __main__.py(__init__:185) >>     * Output Path: [/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial)
2025-03-19 14:16:01,658 [INFO]: __main__.py(__init__:185) >>     * Precheck Path: None
2025-03-19 14:16:01,659 [INFO]: __main__.py(__init__:185) >>     * Log Path: [/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial/20250319_191551_378453.log](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial/20250319_191551_378453.log)
2025-03-19 14:16:01,659 [INFO]: __main__.py(__init__:185) >>     * CMOR Log Path: [/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial/cmor_logs](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serial/cmor_logs)
2025-03-19 14:16:01,660 [INFO]: __main__.py(__init__:185) >>     * Frequency: mon
2025-03-19 14:16:01,660 [INFO]: __main__.py(__init__:185) >>     * Realm: atm
2025-03-19 14:16:10,783 [INFO]: __main__.py(_get_handlers:290) >> --------------------------------------
2025-03-19 14:16:10,784 [INFO]: __main__.py(_get_handlers:291) >> | Derived CMIP6 Variable Handlers
2025-03-19 14:16:10,785 [INFO]: __main__.py(_get_handlers:292) >> --------------------------------------
2025-03-19 14:16:10,785 [INFO]: __main__.py(_get_handlers:294) >>     * 'pfull' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-03-19 14:16:10,785 [INFO]: __main__.py(_get_handlers:294) >>     * 'phalf' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-03-19 14:16:10,786 [INFO]: __main__.py(_get_handlers:294) >>     * 'tas' -> ['TREFHT']
2025-03-19 14:16:10,787 [INFO]: __main__.py(_run_serial:899) >> Trying to CMORize with handler: {'name': 'pfull', 'units': 'Pa', 'raw_variables': ['hyai', 'hybi', 'hyam', 'hybm', 'PS'], 'table': 'CMIP6_Amon.json', 'unit_conversion': None, 'formula': 'hyam * p0 + hybm * ps', 'formula_method': <function pfull at 0x1550a27785e0>, 'positive': None, 'levels': {'name': 'standard_hybrid_sigma', 'units': '1', 'e3sm_axis_name': 'lev', 'e3sm_axis_bnds': 'ilev', 'time_name': 'time2'}, 'output_data': None, 'method': <bound method VarHandler.cmorize of <e3sm_to_cmip.cmor_handlers.handler.VarHandler object at 0x1550a2da27b0>>}
2025-03-19 14:16:10,787 [INFO]: handler.py(cmorize:208) >> pfull: Starting CMORizing
2025-03-19 14:16:11,042 [INFO]: handler.py(_setup_cmor_module:334) >> pfull: CMOR setup complete
2025-03-19 14:16:11,043 [INFO]: handler.py(cmorize:238) >> pfull: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])
2025-03-19 14:16:16,193 [INFO]: handler.py(cmorize:246) >> pfull: creating CMOR variable with CMOR axis objects.
2025-03-19 14:16:19,332 [INFO]: handler.py(_cmor_write_with_time:679) >> pfull: time span 54750.0 - 60225.0
2025-03-19 14:16:19,332 [INFO]: handler.py(_cmor_write_with_time:683) >> pfull: Writing variable to file...
2025-03-19 14:17:34,764 [INFO]: handler.py(_cmor_write_with_time:696) >> pfull: Writing IPS variable to file...
2025-03-19 14:17:38,075 [INFO]: __main__.py(_run_serial:924) >> Finished pfull, 1/3 jobs complete
2025-03-19 14:17:38,094 [INFO]: __main__.py(_run_serial:899) >> Trying to CMORize with handler: {'name': 'phalf', 'units': 'Pa', 'raw_variables': ['hyai', 'hybi', 'hyam', 'hybm', 'PS'], 'table': 'CMIP6_Amon.json', 'unit_conversion': None, 'formula': 'hyai * p0 + hybi * ps', 'formula_method': <function phalf at 0x1550a2779580>, 'positive': None, 'levels': {'name': 'standard_hybrid_sigma_half', 'units': '1', 'e3sm_axis_name': 'lev', 'e3sm_axis_bnds': 'ilev', 'time_name': 'time2'}, 'output_data': None, 'method': <bound method VarHandler.cmorize of <e3sm_to_cmip.cmor_handlers.handler.VarHandler object at 0x1550a2da3f70>>}
2025-03-19 14:17:38,097 [INFO]: handler.py(cmorize:208) >> phalf: Starting CMORizing
2025-03-19 14:17:38,306 [INFO]: handler.py(_setup_cmor_module:334) >> phalf: CMOR setup complete
2025-03-19 14:17:38,307 [INFO]: handler.py(cmorize:238) >> phalf: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])
2025-03-19 14:17:41,981 [INFO]: handler.py(cmorize:246) >> phalf: creating CMOR variable with CMOR axis objects.
2025-03-19 14:17:45,384 [INFO]: handler.py(_cmor_write_with_time:679) >> phalf: time span 54750.0 - 60225.0
2025-03-19 14:17:45,385 [INFO]: handler.py(_cmor_write_with_time:683) >> phalf: Writing variable to file...
2025-03-19 14:17:45,400 [WARNING]: warnings.py(_showwarnmsg:110) >> [/home/ac.tvo/miniforge3/envs/e3sm_diags_dev_274/lib/python3.13/site-packages/cmor/pywrapper.py:759](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/home/ac.tvo/miniforge3/envs/e3sm_diags_dev_274/lib/python3.13/site-packages/cmor/pywrapper.py:759): UserWarning: Error: your data shape ((180, 73, 180, 360)) does not match the expected variable shape ([0, 72, 180, 360])
Check your variable dimensions before caling cmor_write
  warnings.warn(msg)

2025-03-19 14:19:00,403 [INFO]: handler.py(_cmor_write_with_time:696) >> phalf: Writing IPS variable to file...
2025-03-19 14:19:03,765 [INFO]: __main__.py(_run_serial:924) >> Finished phalf, 2/3 jobs complete
2025-03-19 14:19:03,784 [INFO]: __main__.py(_run_serial:899) >> Trying to CMORize with handler: {'name': 'tas', 'units': 'K', 'raw_variables': ['TREFHT'], 'table': 'CMIP6_Amon.json', 'unit_conversion': None, 'formula': None, 'positive': None, 'levels': None, 'output_data': None, 'method': <bound method VarHandler.cmorize of <e3sm_to_cmip.cmor_handlers.handler.VarHandler object at 0x1550a3005f90>>}
2025-03-19 14:19:03,785 [INFO]: handler.py(cmorize:208) >> tas: Starting CMORizing
2025-03-19 14:19:03,979 [INFO]: handler.py(_setup_cmor_module:334) >> tas: CMOR setup complete
2025-03-19 14:19:03,980 [INFO]: handler.py(cmorize:238) >> tas: loading E3SM variables dict_keys(['TREFHT'])
2025-03-19 14:19:04,064 [INFO]: handler.py(cmorize:246) >> tas: creating CMOR variable with CMOR axis objects.
2025-03-19 14:19:04,593 [INFO]: handler.py(_cmor_write_with_time:679) >> tas: time span 54750.0 - 60225.0
2025-03-19 14:19:04,594 [INFO]: handler.py(_cmor_write_with_time:683) >> tas: Writing variable to file...
2025-03-19 14:19:06,007 [INFO]: __main__.py(_run_serial:924) >> Finished tas, 3/3 jobs complete
2025-03-19 14:19:06,008 [INFO]: __main__.py(_run_serial:939) >> 3 of 3 handlers complete

Parallel mode

	2025-03-19 15:36:15,890 [INFO]: __main__.py(__init__:160) >> --------------------------------------
	2025-03-19 15:36:15,891 [INFO]: __main__.py(__init__:161) >> | E3SM to CMIP Configuration
	2025-03-19 15:36:15,892 [INFO]: __main__.py(__init__:162) >> --------------------------------------
	2025-03-19 15:36:15,909 [INFO]: __main__.py(__init__:185) >>     * Timestamp: 20250319_203602_803037
	2025-03-19 15:36:15,910 [INFO]: __main__.py(__init__:185) >>     * Version Info: branch feature/274-redesign-logger with commit b44e78d314741f0a948f966a78b2ad558ce193d4
	2025-03-19 15:36:15,911 [INFO]: __main__.py(__init__:185) >>     * Mode: Parallel
	2025-03-19 15:36:15,911 [INFO]: __main__.py(__init__:185) >>     * Variable List: ['pfull', 'phalf', 'tas']
	2025-03-19 15:36:15,911 [INFO]: __main__.py(__init__:185) >>     * Input Path: /lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded
	2025-03-19 15:36:15,911 [INFO]: __main__.py(__init__:185) >>     * Output Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-parallel
	2025-03-19 15:36:15,912 [INFO]: __main__.py(__init__:185) >>     * Precheck Path: None
	2025-03-19 15:36:15,912 [INFO]: __main__.py(__init__:185) >>     * Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-parallel/20250319_203602_803037.log
	2025-03-19 15:36:15,912 [INFO]: __main__.py(__init__:185) >>     * CMOR Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-parallel/cmor_logs
	2025-03-19 15:36:15,913 [INFO]: __main__.py(__init__:185) >>     * Frequency: mon
	2025-03-19 15:36:15,913 [INFO]: __main__.py(__init__:185) >>     * Realm: atm
	2025-03-19 15:36:24,746 [INFO]: __main__.py(_get_handlers:290) >> --------------------------------------
	2025-03-19 15:36:24,746 [INFO]: __main__.py(_get_handlers:291) >> | Derived CMIP6 Variable Handlers
	2025-03-19 15:36:24,747 [INFO]: __main__.py(_get_handlers:292) >> --------------------------------------
	2025-03-19 15:36:24,748 [INFO]: __main__.py(_get_handlers:294) >>     * 'pfull' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
	2025-03-19 15:36:24,748 [INFO]: __main__.py(_get_handlers:294) >>     * 'phalf' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
	2025-03-19 15:36:24,749 [INFO]: __main__.py(_get_handlers:294) >>     * 'tas' -> ['TREFHT']
	2025-03-19 15:36:24,846 [INFO]: handler.py(cmorize:208) >> pfull: Starting CMORizing
	2025-03-19 15:36:24,847 [INFO]: handler.py(cmorize:208) >> phalf: Starting CMORizing
	2025-03-19 15:36:24,848 [INFO]: handler.py(cmorize:208) >> tas: Starting CMORizing
	2025-03-19 15:36:25,163 [INFO]: handler.py(_setup_cmor_module:334) >> phalf: CMOR setup complete
	2025-03-19 15:36:25,164 [INFO]: handler.py(_setup_cmor_module:334) >> pfull: CMOR setup complete
	2025-03-19 15:36:25,163 [INFO]: handler.py(_setup_cmor_module:334) >> tas: CMOR setup complete
	2025-03-19 15:36:25,166 [INFO]: handler.py(cmorize:238) >> pfull: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])
	2025-03-19 15:36:25,166 [INFO]: handler.py(cmorize:238) >> tas: loading E3SM variables dict_keys(['TREFHT'])
	2025-03-19 15:36:25,166 [INFO]: handler.py(cmorize:238) >> phalf: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])
	2025-03-19 15:36:25,210 [INFO]: handler.py(cmorize:246) >> tas: creating CMOR variable with CMOR axis objects.
	2025-03-19 15:36:28,160 [INFO]: handler.py(_cmor_write_with_time:679) >> tas: time span 54750.0 - 60225.0
	2025-03-19 15:36:28,161 [INFO]: handler.py(_cmor_write_with_time:683) >> tas: Writing variable to file...
	2025-03-19 15:36:31,075 [INFO]: handler.py(cmorize:246) >> pfull: creating CMOR variable with CMOR axis objects.
	2025-03-19 15:36:31,076 [INFO]: handler.py(cmorize:246) >> phalf: creating CMOR variable with CMOR axis objects.
	2025-03-19 15:36:34,779 [INFO]: handler.py(_cmor_write_with_time:679) >> phalf: time span 54750.0 - 60225.0
	2025-03-19 15:36:34,781 [INFO]: handler.py(_cmor_write_with_time:683) >> phalf: Writing variable to file...
	2025-03-19 15:36:34,787 [INFO]: handler.py(_cmor_write_with_time:679) >> pfull: time span 54750.0 - 60225.0
	2025-03-19 15:36:34,789 [INFO]: handler.py(_cmor_write_with_time:683) >> pfull: Writing variable to file...
	2025-03-19 15:36:34,792 [WARNING]: warnings.py(_showwarnmsg:110) >> /home/ac.tvo/miniforge3/envs/e3sm_diags_dev_274/lib/python3.13/site-packages/cmor/pywrapper.py:759: UserWarning: Error: your data shape ((180, 73, 180, 360)) does not match the expected variable shape ([0, 72, 180, 360])
	Check your variable dimensions before caling cmor_write
	  warnings.warn(msg)
	
	2025-03-19 15:37:49,262 [INFO]: handler.py(_cmor_write_with_time:696) >> phalf: Writing IPS variable to file...
	2025-03-19 15:37:49,940 [INFO]: handler.py(_cmor_write_with_time:696) >> pfull: Writing IPS variable to file...
	2025-03-19 15:37:53,328 [INFO]: __main__.py(_run_parallel:1024) >> Finished pfull, 1/3 jobs complete
	2025-03-19 15:37:53,330 [INFO]: __main__.py(_run_parallel:1024) >> Finished phalf, 2/3 jobs complete
	2025-03-19 15:37:53,330 [INFO]: __main__.py(_run_parallel:1024) >> Finished tas, 3/3 jobs complete
	2025-03-19 15:37:53,456 [INFO]: __main__.py(_run_parallel:1033) >> 3 of 3 handlers complete

Info mode

2025-03-19 15:48:58,430 [INFO]: __main__.py(__init__:160) >> --------------------------------------
2025-03-19 15:48:58,431 [INFO]: __main__.py(__init__:161) >> | E3SM to CMIP Configuration
2025-03-19 15:48:58,432 [INFO]: __main__.py(__init__:162) >> --------------------------------------
2025-03-19 15:48:58,450 [INFO]: __main__.py(__init__:185) >>     * Timestamp: 20250319_204848_345105
2025-03-19 15:48:58,450 [INFO]: __main__.py(__init__:185) >>     * Version Info: branch feature/274-redesign-logger with commit b44e78d314741f0a948f966a78b2ad558ce193d4
2025-03-19 15:48:58,451 [INFO]: __main__.py(__init__:185) >>     * Mode: Info
2025-03-19 15:48:58,452 [INFO]: __main__.py(__init__:185) >>     * Variable List: ['pfull', 'phalf', 'tas']
2025-03-19 15:48:58,452 [INFO]: __main__.py(__init__:185) >>     * Input Path: /lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded
2025-03-19 15:48:58,452 [INFO]: __main__.py(__init__:185) >>     * Output Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info
2025-03-19 15:48:58,453 [INFO]: __main__.py(__init__:185) >>     * Precheck Path: None
2025-03-19 15:48:58,453 [INFO]: __main__.py(__init__:185) >>     * Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info/20250319_204848_345105.log
2025-03-19 15:48:58,453 [INFO]: __main__.py(__init__:185) >>     * CMOR Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info/cmor_logs
2025-03-19 15:48:58,454 [INFO]: __main__.py(__init__:185) >>     * Frequency: mon
2025-03-19 15:48:58,454 [INFO]: __main__.py(__init__:185) >>     * Realm: atm
2025-03-19 15:48:58,852 [ERROR]: __main__.py(_run_info_mode:785) >> Variable pfull is not present in the input dataset
2025-03-19 15:48:58,853 [ERROR]: __main__.py(_run_info_mode:785) >> Variable phalf is not present in the input dataset
2025-03-19 15:48:58,854 [ERROR]: __main__.py(_run_info_mode:785) >> Variable tas is not present in the input dataset
2025-03-19 15:48:58,861 [ERROR]: __main__.py(_run_info_mode:785) >> Variable tas is not present in the input dataset
2025-03-19 15:48:58,877 [WARNING]: warnings.py(_showwarnmsg:110) >> /home/ac.tvo/miniforge3/envs/e3sm_diags_dev_274/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3557: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

TODO

  • Instantiate root logger once
    • Should fix duplicate log messages
  • Store log file under output_path
  • Store CMOR log dir under output_path
    • Fix duplicate log file being created -- just consolidate them after the output_path directory is created
    • Fix capturing of warnings -- only affects esmpy warning initally, gets written to first log file so we just move that log file over to output_path and append new messages to it
  • Fix bug with info_mode and output_path specified as not a yaml file.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder changed the title Revamp logger module for additional features and bug fixes Centralize results directories and files and revamp logging Mar 18, 2025
- Rename `_setup_logger()` to `_setup_child_logger()`
- Replace `_update_root_logger_filepath()` with `_add_filehandler()`
- Encapsulating the root logger means no duplicate StreamHandler is added upon import which prevents duplicate messages from appearing
- Colors from `print_message()` cannot be captured in log files
@tomvothecoder tomvothecoder marked this pull request as ready for review March 19, 2025 20:54
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @TonyB9000 and @chengzhuzhang, I decided to tackle e3sm_to_cmip logging because its been on the back-burner for awhile and I finally had some time.

Please refer to the PR description for more information.

Comment on lines 509 to +516
optional.add_argument(
"--logdir",
type=str,
default="./cmor_logs",
help="Where to put the logging output from CMOR.",
default="cmor_logs",
help=(
"The sub-directory that stores the CMOR logs. This sub-directory will "
"be stored under --output-path."
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deprecate the --logdir parameter? It seems unnecessary for the user to specify the name of the directory since it should be stored under --output-path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@TonyB9000 TonyB9000 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder Honestly, I am uncertain (at the moment) what is represented by "the output path". When I run e2c as a function of a supporting application (dsm_generate_CMIP), I force the creation of (CWD)/tmp/<case_id>/[subdirs], and these are:

    caselogs/       [timestamped logs for each E2C invocation]
    metadata/
    native_data/    [symlinks to native soruce files]
    native_out/     [produced by pre-E2C NCO stuff]
    product/        [FINAL Cmorized output from E2C]
    rgr/            [produced by pre-E2C NCO stuff]
    rgr_fixed_vars/ [produced by pre-E2C NCO stuff]
    rgr_vert/       [produced by pre-E2C NCO stuff]
    scripts/        [dataset_specific call scripts to E2C]

I believe I set "--output-path to "product/", with the intent of moving these to "STAGING_DATA" (the warehouse, pre-publication"). The logs of my calling scripts (in "scripts/") are directed to "caselogs", but I think the cmor-logs may be directed to something like (CWD)/tmp/cmor_logs/. These logs are only named by timestamp (I believe), and not by any useful "job-name" or ID, and their - um - "colorful" and flashy format (remember the early HTML <BLINK>text</BLINK> that probably induced seizures in some) cannot be usefully combined with other logs, until we can negotiate with the "cmor_setup()" devs to reformat that stuff.

Comment on lines 661 to 666
# NOTE: Any warnings that appear before the log filehandler is
# instantiated will not be captured (e.g,. esmpy VersionWarning).
# However, they will still be captured by the console via a
# StreamHandler.
self.log_path = os.path.join(self.output_path, LOG_FILENAME) # type: ignore
_add_filehandler(self.log_path)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI, trade-off mentioned in the PR description.

For example:

- Some contain legacy `handle_simple()` functions that have since been refactored as a single `handle_simple()` function
- `phalf.py` and `pfull.py` still use `cdms2` and `cdutil`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables have been refactored.

Comment on lines 52 to +53
def print_debug(e):
# TODO: Deprecate this function. We use Python logger now.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI

# TODO: Deprecate this function. We use Python logger now. Colors can't
# be captured in log files.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI deprecate print_message()

@tomvothecoder tomvothecoder added this to the FY25 Development milestone Mar 19, 2025
@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Apr 14, 2025

Hey @TonyB9000, I addressed your comment about making the timestamp up to 6 microseconds. This PR is now ready for your review.

Here is example:

2025-04-14 16:32:39.229310 [INFO]: __main__.py(__init__:160) >> --------------------------------------
2025-04-14 16:32:39.230547 [INFO]: __main__.py(__init__:161) >> | E3SM to CMIP Configuration
2025-04-14 16:32:39.231215 [INFO]: __main__.py(__init__:162) >> --------------------------------------
2025-04-14 16:32:39.243060 [INFO]: __main__.py(__init__:185) >>     * Timestamp: 20250414_212755_699215
2025-04-14 16:32:39.243937 [INFO]: __main__.py(__init__:185) >>     * Version Info: branch feature/274-redesign-logger with commit 477a3cbfe859ab5044093c9161c9895813bd6fb8
2025-04-14 16:32:39.244475 [INFO]: __main__.py(__init__:185) >>     * Mode: Serial
2025-04-14 16:32:39.244945 [INFO]: __main__.py(__init__:185) >>     * Variable List: ['pfull', 'phalf', 'tas']
2025-04-14 16:32:39.245412 [INFO]: __main__.py(__init__:185) >>     * Input Path: /lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded
2025-04-14 16:32:39.245861 [INFO]: __main__.py(__init__:185) >>     * Output Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14
2025-04-14 16:32:39.246311 [INFO]: __main__.py(__init__:185) >>     * Precheck Path: None
2025-04-14 16:32:39.246695 [INFO]: __main__.py(__init__:185) >>     * Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14/20250414_212755_699215.log
2025-04-14 16:32:39.247211 [INFO]: __main__.py(__init__:185) >>     * CMOR Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14/cmor_logs
2025-04-14 16:32:39.264351 [INFO]: __main__.py(__init__:185) >>     * Frequency: mon
2025-04-14 16:32:39.265105 [INFO]: __main__.py(__init__:185) >>     * Realm: atm
2025-04-14 16:32:47.079968 [INFO]: __main__.py(_get_handlers:290) >> --------------------------------------
2025-04-14 16:32:47.081259 [INFO]: __main__.py(_get_handlers:291) >> | Derived CMIP6 Variable Handlers
2025-04-14 16:32:47.081790 [INFO]: __main__.py(_get_handlers:292) >> --------------------------------------
2025-04-14 16:32:47.082319 [INFO]: __main__.py(_get_handlers:294) >>     * 'pfull' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-04-14 16:32:47.082807 [INFO]: __main__.py(_get_handlers:294) >>     * 'phalf' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-04-14 16:32:47.083299 [INFO]: __main__.py(_get_handlers:294) >>     * 'tas' -> ['TREFHT']
2025-04-14 16:32:47.084244 [INFO]: __main__.py(_run_serial:899) >> Trying to CMORize with handler: {'name': 'pfull', 'units': 'Pa', 'raw_variables': ['hyai', 'hybi', 'hyam', 'hybm', 'PS'], 'table': 'CMIP6_Amon.json', 'unit_conversion': None, 'formula': 'hyam * p0 + hybm * ps', 'formula_method': <function pfull at 0x1550a29a2f20>, 'positive': None, 'levels': {'name': 'standard_hybrid_sigma', 'units': '1', 'e3sm_axis_name': 'lev', 'e3sm_axis_bnds': 'ilev', 'time_name': 'time2'}, 'output_data': None, 'method': <bound method VarHandler.cmorize of <e3sm_to_cmip.cmor_handlers.handler.VarHandler object at 0x1550a24e4ff0>>}
2025-04-14 16:32:47.084742 [INFO]: handler.py(cmorize:208) >> pfull: Starting CMORizing
2025-04-14 16:32:47.222432 [INFO]: handler.py(_setup_cmor_module:334) >> pfull: CMOR setup complete
2025-04-14 16:32:47.224274 [INFO]: handler.py(cmorize:238) >> pfull: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])

My final todos are

  • Delete the /tmp folder at the end of each run because it is empty anyways (fixed by 0b69b1a)
  • Make sure log files are distinct between runs (fixed by 0b69b1a)

@tomvothecoder tomvothecoder self-assigned this Apr 14, 2025
@tomvothecoder tomvothecoder requested a review from TonyB9000 April 14, 2025 21:33
- Add `_cleanup_temp_dir()` to remove temp dir at the end of a successful run
@TonyB9000
Copy link
Contributor

@tomvothecoder I like the apparent results. I would like to run it using my e2c driver (dsm_generate_CMIP6.py) that auto-configures all runs. I don't expect any real issues. I will create a test environment, install this branch, and then run the same test-jobs I have run earlier, and see how it goes (what goes where, etc). I should have significant results tomorrow afternoon.

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 15, 2025

@tomvothecoder I am getting an error with the "--info" mode.. My command line is

 ['e3sm_to_cmip', '--info', '-v', 'hfsifrazil', '--freq', 'mon', '--realm', 'mpaso', '-t', '/lcrc/group/e3sm2/DSM/Staging/Resource/cmor/cmip6-cmor-tables/Tables', '--map', 'no_map', '--info-out', '/lcrc/group/e3sm2/DSM/Ops/DSM_Manager/tmp/info_yaml/Omon_hfsifrazil.yaml']

The error thrown is:

  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/util.py", line 397, in copy_user_metadata
    fin = open(input_path, "r")
          ^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Previously, I believe no "input path" was required for "--info" mode, which only needs the variable and the cmor-tables.

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder I am getting an error with the "--info" mode.. My command line is

 ['e3sm_to_cmip', '--info', '-v', 'hfsifrazil', '--freq', 'mon', '--realm', 'mpaso', '-t', '/lcrc/group/e3sm2/DSM/Staging/Resource/cmor/cmip6-cmor-tables/Tables', '--map', 'no_map', '--info-out', '/lcrc/group/e3sm2/DSM/Ops/DSM_Manager/tmp/info_yaml/Omon_hfsifrazil.yaml']

The error thrown is:

  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/util.py", line 397, in copy_user_metadata
    fin = open(input_path, "r")
          ^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Previously, I believe no "input path" was required for "--info" mode, which only needs the variable and the cmor-tables.

Thanks for testing and reporting this issue. I'll take a look.

@TonyB9000
Copy link
Contributor

@tomvothecoder Just a note: I recall this problem was one of a few that occur because e2c tries to post-process the args before accepting them all, and realizing it need not use all the "required" args. I believe all supplied args should be read/recorded before processing any one of them.

- Make timestamp unique per run by moving timestamp initialization to `app.__init__`
@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder Just a note: I recall this problem was one of a few that occur because e2c tries to post-process the args before accepting them all, and realizing it need not use all the "required" args. I believe all supplied args should be read/recorded before processing any one of them.

@TonyB9000 I found the root cause of this issue: I removed the conditional if not self.info_mode in the code block shown below.

# ======================================================================
if not self.info_mode:
self._setup_dirs_with_paths()

This causes copy_user_metadata() to run, which requires self.user_metadata. However, self.user_metadata is not a required arg with info mode.

# Copy the user's metadata json file with the updated output directory
if not self.simple_mode:
copy_user_metadata(self.user_metadata, self.output_path)

This results in:

  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/util.py", line 397, in copy_user_metadata
    fin = open(input_path, "r")
          ^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

This issue should now be fixed with a4b9f45 (#289).

Please test again whenever you can.

@TonyB9000
Copy link
Contributor

@tomvothecoder OK! It appears to be running fine now. My derived (auto-produced) script launched 17 separate e2c jobs for the variable "hfsifrazil" (one for each decade in 1850 ... 2010), and after the first minute, 9 were RUNNING and 8 were PENDING. After 15 minutes 10 were COMPLETED and 7 were RUNNING. The mean_et of the 10 completed jobs was 866 seconds, and the remaining 7 had each accumulated about 360 seconds (job status loop checks every 300 seconds. If any job is still RUNNING after about 5*866 seconds, it should be killed (scanceled) automatically - I have yet to test that...

Overall, I think this (e2c) is clearly running very well! SUCCESS!

@TonyB9000
Copy link
Contributor

@tomvothecoder A late-developing issue. Every previous time I've run E2C, (as 17 parallel jobs), the outputs would accumulate in the output directory: "product/CMIP6/CMIP/ etc etc /hfsfrazil/gr/v202504xx" and be collected upon completion; But not this time:

WH_PATH: /lcrc/group/e3sm2/DSM/Staging/Data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr
      v20250404:  15 files
      v20250409:  16 files
      v20250410:  16 files
      v20250415:  4 files
PB_PATH: /lcrc/group/e3sm2/DSM/Publication/css03_data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr
      (empty)
(dsm_test_e2c) [ac.bartoletti1@chrlogin2 DSM_Manager]$ ll /lcrc/group/e3sm2/DSM/Staging/Data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr/v20250415
total 210576
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60450707 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_186001-186912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 59975748 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_187001-187912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60601747 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_190001-190912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 34580480 Apr 15 16:23 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_201001-201511.nc

Not sure why this changed.

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder A late-developing issue. Every previous time I've run E2C, (as 17 parallel jobs), the outputs would accumulate in the output directory: "product/CMIP6/CMIP/ etc etc /hfsfrazil/gr/v202504xx" and be collected upon completion; But not this time:

WH_PATH: /lcrc/group/e3sm2/DSM/Staging/Data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr
      v20250404:  15 files
      v20250409:  16 files
      v20250410:  16 files
      v20250415:  4 files
PB_PATH: /lcrc/group/e3sm2/DSM/Publication/css03_data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr
      (empty)
(dsm_test_e2c) [ac.bartoletti1@chrlogin2 DSM_Manager]$ ll /lcrc/group/e3sm2/DSM/Staging/Data/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr/v20250415
total 210576
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60450707 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_186001-186912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 59975748 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_187001-187912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60601747 Apr 15 16:15 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_190001-190912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 34580480 Apr 15 16:23 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_201001-201511.nc

Not sure why this changed.

Okay from what I understand, this is the issue with only 4 files being generated in the CMIP6 directory that you mentioned in our meeting today.

Can you provide the exact e3sm_to_cmip command you ran for this task?

@TonyB9000
Copy link
Contributor

@tomvothecoder I'll test forthwith! Should have results in 1 hour.

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 23, 2025

@tomvothecoder SUCCESS! All jobs completed successfully, all wrote an output file (except 1920, the "bad data" decade). Took just over 30 minutes. Great job tracking that down!

(Also - no srun job "hung", which may be related, or just coincidental.)

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder SUCCESS! All jobs completed successfully, all wrote an output file (except 1920, the "bad data" decade). Took just over 30 minutes. Great job tracking that down!

(Also - no srun job "hung", which may be related, or just coincidental.)

Excellent! I just pushed commit bcfa40a (#289) that implements the alternative solution of making each tmp dir timestamp unique, which avoids jobs deleting a shared tmp dir.

Can you pull again and test one more time? Thanks!


if temp_path is None:
temp_path = f"{self.output_path}/tmp"
temp_path = f"{self.output_path}/tmp_{self.timestamp}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the tmp dir timestamp unique per e3sm_to_cmip invocation should prevent the issue where parallel e3sm_to_cmip jobs might delete the same shared tmp folder (resulting in silent CMORizing issues).

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 23, 2025

@tomvothecoder Test is underway.

P.S. I assume the random tmp dir is deleted at closure. It might be wise to condition that on "--debug = False", Also, the full path to the random tmp dir should be added to the table of parameters near the top of the e2c log, if not already.

This makes me realize - when I specify a custom $TMPDIR in my .bashrc, I rarely ever look to clear it out, When disk-space and inodes get scarce, that is something to address. (especially when running 17 jobs in place of 1).

@TonyB9000
Copy link
Contributor

@tomvothecoder Uh-Oh. We have a problem. The failed outputs have returned.

==== 20250423_230402 ============
     JOBID PARTITION                     NAME         USER    STATE       TIME   TIME_LIMIT  NODES NODELIST(REASON)
    733442   compute  e2c_hfsifrazil_seg-1960 ac.bartolett COMPLETI      14:58   2-00:00:00      1 chr-0137
    733441   compute  e2c_hfsifrazil_seg-1950 ac.bartolett COMPLETI      15:51   2-00:00:00      1 chr-0136
    733437   compute  e2c_hfsifrazil_seg-1910 ac.bartolett COMPLETI      16:06   2-00:00:00      1 chr-0132
    733438   compute  e2c_hfsifrazil_seg-1920 ac.bartolett COMPLETI      16:05   2-00:00:00      1 chr-0133
    733439   compute  e2c_hfsifrazil_seg-1930 ac.bartolett COMPLETI      16:01   2-00:00:00      1 chr-0134
    733433   compute  e2c_hfsifrazil_seg-1880 ac.bartolett COMPLETI      16:33   2-00:00:00      1 chr-0126
    733434   compute  e2c_hfsifrazil_seg-1890 ac.bartolett COMPLETI      16:14   2-00:00:00      1 chr-0128
    733429   compute  e2c_hfsifrazil_seg-1850 ac.bartolett COMPLETI      16:44   2-00:00:00      1 chr-0068
    733430   compute  e2c_hfsifrazil_seg-1860 ac.bartolett COMPLETI      16:46   2-00:00:00      1 chr-0069
    733445   compute  e2c_hfsifrazil_seg-1990 ac.bartolett  PENDING       0:00   2-00:00:00      1 (Priority)
    733446   compute  e2c_hfsifrazil_seg-2000 ac.bartolett  PENDING       0:00   2-00:00:00      1 (Priority)
    733447   compute  e2c_hfsifrazil_seg-2010 ac.bartolett  PENDING       0:00   2-00:00:00      1 (Priority)
    733443   compute  e2c_hfsifrazil_seg-1970 ac.bartolett  RUNNING      15:21   2-00:00:00      1 chr-0138
    733444   compute  e2c_hfsifrazil_seg-1980 ac.bartolett  RUNNING      15:21   2-00:00:00      1 chr-0139
    733440   compute  e2c_hfsifrazil_seg-1940 ac.bartolett  RUNNING      15:51   2-00:00:00      1 chr-0135
    733436   compute  e2c_hfsifrazil_seg-1900 ac.bartolett  RUNNING      16:51   2-00:00:00      1 chr-0129
    733431   compute  e2c_hfsifrazil_seg-1870 ac.bartolett  RUNNING      17:21   2-00:00:00      1 chr-0080

ls -l tmp/v2.NARRM.historical_0151/product/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr/v20250423
total 176848
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60450707 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_186001-186912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60331759 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_188001-188912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60283169 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_189001-189912.nc

==== 20250423_230902 ============
     JOBID PARTITION                     NAME         USER    STATE       TIME   TIME_LIMIT  NODES NODELIST(REASON)
    733445   compute  e2c_hfsifrazil_seg-1990 ac.bartolett  RUNNING       4:21   2-00:00:00      1 chr-0068
    733446   compute  e2c_hfsifrazil_seg-2000 ac.bartolett  RUNNING       4:21   2-00:00:00      1 chr-0069
    733447   compute  e2c_hfsifrazil_seg-2010 ac.bartolett  RUNNING       4:51   2-00:00:00      1 chr-0053
    733443   compute  e2c_hfsifrazil_seg-1970 ac.bartolett  RUNNING      20:21   2-00:00:00      1 chr-0138
    733440   compute  e2c_hfsifrazil_seg-1940 ac.bartolett  RUNNING      20:51   2-00:00:00      1 chr-0135
    733436   compute  e2c_hfsifrazil_seg-1900 ac.bartolett  RUNNING      21:51   2-00:00:00      1 chr-0129

ls -l tmp/v2.NARRM.historical_0151/product/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr/v20250423
total 176848
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60450707 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_186001-186912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60331759 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_188001-188912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60283169 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_189001-189912.nc

@TonyB9000
Copy link
Contributor

@tomvothecoder Is there a chance that the "cmor_write()" returns before write is complete, and wiping the tmp dir can collide with it? We may need to implement a "write_confirmed" (like, "wait on write complete", before deleting tmp?

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 23, 2025

@tomvothecoder And on top of this, 3 of the node-jobs appear to be hung hung . . .

==== 20250423_232903 ============
     JOBID PARTITION                     NAME         USER    STATE       TIME   TIME_LIMIT  NODES NODELIST(REASON)
    733443   compute  e2c_hfsifrazil_seg-1970 ac.bartolett  RUNNING      40:22   2-00:00:00      1 chr-0138
    733440   compute  e2c_hfsifrazil_seg-1940 ac.bartolett  RUNNING      40:52   2-00:00:00      1 chr-0135
    733436   compute  e2c_hfsifrazil_seg-1900 ac.bartolett  RUNNING      41:52   2-00:00:00      1 chr-0129

ls -l tmp/v2.NARRM.historical_0151/product/CMIP6/CMIP/E3SM-Project/E3SM-2-0-NARRM/historical/r2i1p1f1/Omon/hfsifrazil/gr/v20250423
total 210624
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60450707 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_186001-186912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60331759 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_188001-188912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 60283169 Apr 23 18:03 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_189001-189912.nc
-rw-rw-r--+ 1 ac.bartoletti1 E3SM2 34580480 Apr 23 18:10 hfsifrazil_Omon_E3SM-2-0-NARRM_historical_r2i1p1f1_gr_201001-201511.nc

Must be related - but how? These fails (hangs) only partially align with failed writes...

Probably doesn't matter - but I pause 10 seconds between each of the 17 srun submissions, in an attempt to avoid unknown timing issues - but that would not matter if slurm holds them all as PENDING, and then decides to launch them all at once. I cannot imagine the "random directory/file" utility giving the same value to two different independent processes, due to a timing collision...

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder Is there a chance that the "cmor_write()" returns before write is complete, and wiping the tmp dir can collide with it? We may need to implement a "write_confirmed" (like, "wait on write complete", before deleting tmp?

The tmp dir cleanup happens at the very end of the run, so it should not collide with cmor_write(). I think I will just revert the cleanup code for now and consider adding it later.

Side-note: I noticed the tmp dir is only used by the mpas.py module. I think it stores netCDF files specifically for MPAS variables before remapping is performed. tmp dir is not very-well explained or clear in the current code.

  • # write the dataset to a temp file
    inFileName = _get_temp_path()
  • # missing_value_mask attribute has undesired impacts in ncremap
    for varName in ds.data_vars:
    ds[varName].attrs.pop("missing_value_mask", None)
    write_netcdf(ds, inFileName, unlimited="time")
    if pcode == "mpasocean":
    remap_ocean(inFileName, outFileName, mappingFileName)
    elif pcode == "mpasseaice":
    # MPAS-Seaice is a special case because the of the time-varying SGS field
    remap_seaice_sgs(inFileName, outFileName, mappingFileName)
    else:
    raise ValueError(f"pcode: {pcode} is not supported.")
  • def _get_temp_path():
    """Returns the name of a temporary NetCDF file"""
    tmpdir = tempfile.gettempdir()
    tmpfile = tempfile.NamedTemporaryFile(dir=tmpdir, delete=False)
    tmpname = tmpfile.name
    tmpfile.close()

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Apr 24, 2025

@TonyB9000 I just pushed commit 7564099 (#289) which reverts the timestamped tmp dir and tmp dir clean up code. It also outputs the tmp dir path, with a clear title "Temp Path for Processing MPAS Files: tmp dir here".

Can you pull the latest commit and test again to confirm things are good?

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 24, 2025

@tomvothecoder FWIW, I ran a test where I modified main.py: cleanup_temp_dir() by adding

import time
time.sleep(120)

before calling the shutil.rmtree(), and as a result, all decades were written except 1920 (expected), and 1930, 1960, and 2010. (This time, slurm only gave me 4 nodes, so only 4 jobs at a time were running.)

According to LivChat . . . cmor.close() should not return until the preceding cmor.write has flushed the write to the file. I am not trusting this, obviously.

Also - no job "hung" this time.

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder FWIW, I ran a test where I modified main.py: cleanup_temp_dir() by adding

import time
time.sleep(120)

before calling the shutil.rmtree(), and as a result, all decades were written except 1920 (expected), and 1930, 1960, and 2010. (This time, slurm only gave me 4 nodes, so only 4 jobs at a time are running - 2010 is still in PENDING state.)

According to LivChat . . . cmor.close() should not return until the preceding cmor.write has flushed the write to the file. I am not trusting this, obviously.

Also - no job "hung" this time.

Thanks Tony. I think it isn't worth the trouble to get timestamped tmp dir/clean up working in this PR for now, since it seems to cause issues with CMORizing. We can consider it later in another PR.

Can you again with the latest commit (comment above)?

@TonyB9000
Copy link
Contributor

@tomvothecoder Sure. But if we do not remove the temp dir(s), I think it is imperative to have the temp file/dir noted in a log message, eventually.

@TonyB9000
Copy link
Contributor

TonyB9000 commented Apr 24, 2025

@tomvothecoder We threw an error:

  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/bin/e3sm_to_cmip", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 1048, in main
    app = E3SMtoCMIP(args)
          ^^^^^^^^^^^^^^^^
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 183, in __init__
    "Temp Path for Processing MPAS Files": self.temp_path,
                                           ^^^^^^^^^^^^^^

Either we need to define the mpas "temp_path" earlier, or (at cmor.close()) add a message "Not deleting MPAS temp_path ...".
Either way, the user can try to find it.

@tomvothecoder
Copy link
Collaborator Author

Sure. But if we do not remove the temp dir(s), I think it is imperative to have the temp file/dir noted in a log message, eventually.

This is done in the recent commits.

@tomvothecoder We threw an error:

  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/bin/e3sm_to_cmip", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 1048, in main
    app = E3SMtoCMIP(args)
          ^^^^^^^^^^^^^^^^
  File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 183, in __init__
    "Temp Path for Processing MPAS Files": self.temp_path,
                                           ^^^^^^^^^^^^^^

Either we need to define the mpas "temp_path" earlier, or (at cmor.close()) add a message "Not deleting MPAS temp_path ...". Either way, the user can try to find it.

Did you try running on info mode? I accidentally forgot to set self.temp_path = None for info mode since it isn't used.
My mistake. Just pushed 66abd3a (#289) to fix it.

Another test is needed.

@TonyB9000
Copy link
Contributor

@tomvothecoder OK, will do. (did not realize that error was info-mode). The "dsm_generate_cmip6.py" configured-script generator automatically calls "info-mode" first, to obtain the handler and CMIP variable info.

I'll let you know how it goes - I think we are at success here.

@TonyB9000
Copy link
Contributor

@tomvothecoder Perfection! All jobs completed and files written. I think we're golden - Thanks!

Someday, we should investigate why the removal of temp files after cmor.close() ("supposedly" waiting on writes being flushed) seems not to, and screws things up. My test, putting a "sleep(120)" (2 minute) pausing before the cleanup enabled 13 of the 16 files to be written. Without the sleep, only 3-4 were written. That should not happen.

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder Perfection! All jobs completed and files written. I think we're golden - Thanks!

Someday, we should investigate why the removal of temp files after cmor.close() ("supposedly" waiting on writes being flushed) seems not to, and screws things up. My test, putting a "sleep(120)" (2 minute) pausing before the cleanup enabled 13 of the 16 files to be written. Without the sleep, only 3-4 were written. That should not happen.

Excellent, thank you for all the testing Tony! I will open another GitHub ticket for this specific cmor issue. We can also escalate to the cmor devs if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

[Feature]: Centralize results with all log files to --output_path [Feature]: Redesign logger module

3 participants