-
Notifications
You must be signed in to change notification settings - Fork 7
Centralize results directories and files and revamp logging #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Rename `_setup_logger()` to `_setup_child_logger()` - Replace `_update_root_logger_filepath()` with `_add_filehandler()` - Encapsulating the root logger means no duplicate StreamHandler is added upon import which prevents duplicate messages from appearing
- Colors from `print_message()` cannot be captured in log files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @TonyB9000 and @chengzhuzhang, I decided to tackle e3sm_to_cmip logging because its been on the back-burner for awhile and I finally had some time.
Please refer to the PR description for more information.
| optional.add_argument( | ||
| "--logdir", | ||
| type=str, | ||
| default="./cmor_logs", | ||
| help="Where to put the logging output from CMOR.", | ||
| default="cmor_logs", | ||
| help=( | ||
| "The sub-directory that stores the CMOR logs. This sub-directory will " | ||
| "be stored under --output-path." | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we deprecate the --logdir parameter? It seems unnecessary for the user to specify the name of the directory since it should be stored under --output-path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomvothecoder Honestly, I am uncertain (at the moment) what is represented by "the output path". When I run e2c as a function of a supporting application (dsm_generate_CMIP), I force the creation of (CWD)/tmp/<case_id>/[subdirs], and these are:
caselogs/ [timestamped logs for each E2C invocation]
metadata/
native_data/ [symlinks to native soruce files]
native_out/ [produced by pre-E2C NCO stuff]
product/ [FINAL Cmorized output from E2C]
rgr/ [produced by pre-E2C NCO stuff]
rgr_fixed_vars/ [produced by pre-E2C NCO stuff]
rgr_vert/ [produced by pre-E2C NCO stuff]
scripts/ [dataset_specific call scripts to E2C]
I believe I set "--output-path to "product/", with the intent of moving these to "STAGING_DATA" (the warehouse, pre-publication"). The logs of my calling scripts (in "scripts/") are directed to "caselogs", but I think the cmor-logs may be directed to something like (CWD)/tmp/cmor_logs/. These logs are only named by timestamp (I believe), and not by any useful "job-name" or ID, and their - um - "colorful" and flashy format (remember the early HTML <BLINK>text</BLINK> that probably induced seizures in some) cannot be usefully combined with other logs, until we can negotiate with the "cmor_setup()" devs to reformat that stuff.
| # NOTE: Any warnings that appear before the log filehandler is | ||
| # instantiated will not be captured (e.g,. esmpy VersionWarning). | ||
| # However, they will still be captured by the console via a | ||
| # StreamHandler. | ||
| self.log_path = os.path.join(self.output_path, LOG_FILENAME) # type: ignore | ||
| _add_filehandler(self.log_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI, trade-off mentioned in the PR description.
| For example: | ||
|
|
||
| - Some contain legacy `handle_simple()` functions that have since been refactored as a single `handle_simple()` function | ||
| - `phalf.py` and `pfull.py` still use `cdms2` and `cdutil` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables have been refactored.
| def print_debug(e): | ||
| # TODO: Deprecate this function. We use Python logger now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI
| # TODO: Deprecate this function. We use Python logger now. Colors can't | ||
| # be captured in log files. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI deprecate print_message()
|
Hey @TonyB9000, I addressed your comment about making the timestamp up to 6 microseconds. This PR is now ready for your review. Here is example:
2025-04-14 16:32:39.229310 [INFO]: __main__.py(__init__:160) >> --------------------------------------
2025-04-14 16:32:39.230547 [INFO]: __main__.py(__init__:161) >> | E3SM to CMIP Configuration
2025-04-14 16:32:39.231215 [INFO]: __main__.py(__init__:162) >> --------------------------------------
2025-04-14 16:32:39.243060 [INFO]: __main__.py(__init__:185) >> * Timestamp: 20250414_212755_699215
2025-04-14 16:32:39.243937 [INFO]: __main__.py(__init__:185) >> * Version Info: branch feature/274-redesign-logger with commit 477a3cbfe859ab5044093c9161c9895813bd6fb8
2025-04-14 16:32:39.244475 [INFO]: __main__.py(__init__:185) >> * Mode: Serial
2025-04-14 16:32:39.244945 [INFO]: __main__.py(__init__:185) >> * Variable List: ['pfull', 'phalf', 'tas']
2025-04-14 16:32:39.245412 [INFO]: __main__.py(__init__:185) >> * Input Path: /lcrc/group/e3sm/e3sm_to_cmip/test-cases/atm-unified-eam/input-regridded
2025-04-14 16:32:39.245861 [INFO]: __main__.py(__init__:185) >> * Output Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14
2025-04-14 16:32:39.246311 [INFO]: __main__.py(__init__:185) >> * Precheck Path: None
2025-04-14 16:32:39.246695 [INFO]: __main__.py(__init__:185) >> * Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14/20250414_212755_699215.log
2025-04-14 16:32:39.247211 [INFO]: __main__.py(__init__:185) >> * CMOR Log Path: /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-info-25-04-14/cmor_logs
2025-04-14 16:32:39.264351 [INFO]: __main__.py(__init__:185) >> * Frequency: mon
2025-04-14 16:32:39.265105 [INFO]: __main__.py(__init__:185) >> * Realm: atm
2025-04-14 16:32:47.079968 [INFO]: __main__.py(_get_handlers:290) >> --------------------------------------
2025-04-14 16:32:47.081259 [INFO]: __main__.py(_get_handlers:291) >> | Derived CMIP6 Variable Handlers
2025-04-14 16:32:47.081790 [INFO]: __main__.py(_get_handlers:292) >> --------------------------------------
2025-04-14 16:32:47.082319 [INFO]: __main__.py(_get_handlers:294) >> * 'pfull' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-04-14 16:32:47.082807 [INFO]: __main__.py(_get_handlers:294) >> * 'phalf' -> ['hyai', 'hybi', 'hyam', 'hybm', 'PS']
2025-04-14 16:32:47.083299 [INFO]: __main__.py(_get_handlers:294) >> * 'tas' -> ['TREFHT']
2025-04-14 16:32:47.084244 [INFO]: __main__.py(_run_serial:899) >> Trying to CMORize with handler: {'name': 'pfull', 'units': 'Pa', 'raw_variables': ['hyai', 'hybi', 'hyam', 'hybm', 'PS'], 'table': 'CMIP6_Amon.json', 'unit_conversion': None, 'formula': 'hyam * p0 + hybm * ps', 'formula_method': <function pfull at 0x1550a29a2f20>, 'positive': None, 'levels': {'name': 'standard_hybrid_sigma', 'units': '1', 'e3sm_axis_name': 'lev', 'e3sm_axis_bnds': 'ilev', 'time_name': 'time2'}, 'output_data': None, 'method': <bound method VarHandler.cmorize of <e3sm_to_cmip.cmor_handlers.handler.VarHandler object at 0x1550a24e4ff0>>}
2025-04-14 16:32:47.084742 [INFO]: handler.py(cmorize:208) >> pfull: Starting CMORizing
2025-04-14 16:32:47.222432 [INFO]: handler.py(_setup_cmor_module:334) >> pfull: CMOR setup complete
2025-04-14 16:32:47.224274 [INFO]: handler.py(cmorize:238) >> pfull: loading E3SM variables dict_keys(['hyai', 'hybi', 'hyam', 'hybm', 'PS'])My final todos are |
- Add `_cleanup_temp_dir()` to remove temp dir at the end of a successful run
|
@tomvothecoder I like the apparent results. I would like to run it using my e2c driver (dsm_generate_CMIP6.py) that auto-configures all runs. I don't expect any real issues. I will create a test environment, install this branch, and then run the same test-jobs I have run earlier, and see how it goes (what goes where, etc). I should have significant results tomorrow afternoon. |
|
@tomvothecoder I am getting an error with the "--info" mode.. My command line is The error thrown is: Previously, I believe no "input path" was required for "--info" mode, which only needs the variable and the cmor-tables. |
Thanks for testing and reporting this issue. I'll take a look. |
|
@tomvothecoder Just a note: I recall this problem was one of a few that occur because e2c tries to post-process the args before accepting them all, and realizing it need not use all the "required" args. I believe all supplied args should be read/recorded before processing any one of them. |
- Make timestamp unique per run by moving timestamp initialization to `app.__init__`
@TonyB9000 I found the root cause of this issue: I removed the conditional e3sm_to_cmip/e3sm_to_cmip/__main__.py Lines 166 to 168 in ff8bb82
This causes e3sm_to_cmip/e3sm_to_cmip/__main__.py Lines 636 to 638 in ff8bb82
This results in: File "/home/ac.bartoletti1/anaconda3/envs/dsm_test_e2c/lib/python3.12/site-packages/e3sm_to_cmip/util.py", line 397, in copy_user_metadata
fin = open(input_path, "r")
^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneTypeThis issue should now be fixed with
|
|
@tomvothecoder OK! It appears to be running fine now. My derived (auto-produced) script launched 17 separate e2c jobs for the variable "hfsifrazil" (one for each decade in 1850 ... 2010), and after the first minute, 9 were RUNNING and 8 were PENDING. After 15 minutes 10 were COMPLETED and 7 were RUNNING. The mean_et of the 10 completed jobs was 866 seconds, and the remaining 7 had each accumulated about 360 seconds (job status loop checks every 300 seconds. If any job is still RUNNING after about 5*866 seconds, it should be killed (scanceled) automatically - I have yet to test that... Overall, I think this (e2c) is clearly running very well! SUCCESS! |
|
@tomvothecoder A late-developing issue. Every previous time I've run E2C, (as 17 parallel jobs), the outputs would accumulate in the output directory: "product/CMIP6/CMIP/ etc etc /hfsfrazil/gr/v202504xx" and be collected upon completion; But not this time: Not sure why this changed. |
Okay from what I understand, this is the issue with only 4 files being generated in the CMIP6 directory that you mentioned in our meeting today. Can you provide the exact |
|
@tomvothecoder I'll test forthwith! Should have results in 1 hour. |
|
@tomvothecoder SUCCESS! All jobs completed successfully, all wrote an output file (except 1920, the "bad data" decade). Took just over 30 minutes. Great job tracking that down! (Also - no srun job "hung", which may be related, or just coincidental.) |
Excellent! I just pushed commit Can you pull again and test one more time? Thanks! |
e3sm_to_cmip/__main__.py
Outdated
|
|
||
| if temp_path is None: | ||
| temp_path = f"{self.output_path}/tmp" | ||
| temp_path = f"{self.output_path}/tmp_{self.timestamp}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the tmp dir timestamp unique per e3sm_to_cmip invocation should prevent the issue where parallel e3sm_to_cmip jobs might delete the same shared tmp folder (resulting in silent CMORizing issues).
|
@tomvothecoder Test is underway. P.S. I assume the random tmp dir is deleted at closure. It might be wise to condition that on "--debug = False", Also, the full path to the random tmp dir should be added to the table of parameters near the top of the e2c log, if not already. This makes me realize - when I specify a custom $TMPDIR in my .bashrc, I rarely ever look to clear it out, When disk-space and inodes get scarce, that is something to address. (especially when running 17 jobs in place of 1). |
|
@tomvothecoder Uh-Oh. We have a problem. The failed outputs have returned. |
|
@tomvothecoder Is there a chance that the "cmor_write()" returns before write is complete, and wiping the tmp dir can collide with it? We may need to implement a "write_confirmed" (like, "wait on write complete", before deleting tmp? |
|
@tomvothecoder And on top of this, 3 of the node-jobs appear to be hung hung . . . Must be related - but how? These fails (hangs) only partially align with failed writes... Probably doesn't matter - but I pause 10 seconds between each of the 17 srun submissions, in an attempt to avoid unknown timing issues - but that would not matter if slurm holds them all as PENDING, and then decides to launch them all at once. I cannot imagine the "random directory/file" utility giving the same value to two different independent processes, due to a timing collision... |
The Side-note: I noticed the
|
|
@TonyB9000 I just pushed commit Can you pull the latest commit and test again to confirm things are good? |
|
@tomvothecoder FWIW, I ran a test where I modified main.py: cleanup_temp_dir() by adding before calling the shutil.rmtree(), and as a result, all decades were written except 1920 (expected), and 1930, 1960, and 2010. (This time, slurm only gave me 4 nodes, so only 4 jobs at a time were running.) According to LivChat . . . cmor.close() should not return until the preceding cmor.write has flushed the write to the file. I am not trusting this, obviously. Also - no job "hung" this time. |
Thanks Tony. I think it isn't worth the trouble to get timestamped Can you again with the latest commit (comment above)? |
|
@tomvothecoder Sure. But if we do not remove the temp dir(s), I think it is imperative to have the temp file/dir noted in a log message, eventually. |
|
@tomvothecoder We threw an error: Either we need to define the mpas "temp_path" earlier, or (at cmor.close()) add a message "Not deleting MPAS temp_path ...". |
This is done in the recent commits.
Did you try running on info mode? I accidentally forgot to set Another test is needed. |
|
@tomvothecoder OK, will do. (did not realize that error was info-mode). The "dsm_generate_cmip6.py" configured-script generator automatically calls "info-mode" first, to obtain the handler and CMIP variable info. I'll let you know how it goes - I think we are at success here. |
|
@tomvothecoder Perfection! All jobs completed and files written. I think we're golden - Thanks! Someday, we should investigate why the removal of temp files after cmor.close() ("supposedly" waiting on writes being flushed) seems not to, and screws things up. My test, putting a "sleep(120)" (2 minute) pausing before the cleanup enabled 13 of the 16 files to be written. Without the sleep, only 3-4 were written. That should not happen. |
Excellent, thank you for all the testing Tony! I will open another GitHub ticket for this specific |
Description
This PR centralizes all output files for an
e3sm_to_cmiprun into a single directory and improves logging to eliminate duplicate messages while adding useful metadata.Key Changes
1. Centralized Output Directory (
output_path)--output_path#290output_path.20250319_191551_378453.logfile (no longer stored in the current working directory)/CMIP6– CMORized.ncfiles/tmp– Raw.ncinput files (deleted after the run)/cmor_logs– CMOR logs (configurable via--logdir, no longer stored in the current working directory)user_metadata.jsoninfo.yaml(if--infomode, stores variable handler info)📌 Note: If
--output_pathis not provided, a default directory is created in the current working directory with a timestamped name, e.g.,e3sm_to_cmip_run_20250319_184835_367503.2. Improved Logging System
_logger.py) to manage all logging:_setup_child_logger().2025-03-19 15:48:58,430123 [INFO]: __main__.py(__init__:160))💡 Caveat: Warnings raised during import (e.g.,
esmpyversion warnings) will still appear in the console but won’t be captured in log files. This trade-off enables dynamic log file paths.3. Replaces
print_message()andprint_debug()with Loggerprint()statements (which don’t get logged).Examples
Serial mode
--output_path /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-serialParallel mode
--output_path /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-parallelInfo mode
--output_path /lcrc/group/e3sm/public_html/e3sm_to_cmip/feature-274-redesign-logger-infoTODO
output_pathoutput_pathoutput_pathdirectory is createdesmpywarning initally, gets written to first log file so we just move that log file over tooutput_pathand append new messages to itinfo_modeandoutput_pathspecified as not a yaml file.Checklist
If applicable: