-
Notifications
You must be signed in to change notification settings - Fork 7
Add --on-var-failure flag to handle failure behaviors
#323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Defaults to ignoring failures
Integration Test ChecklistGeneral Behavior
Serial Mode (
|
|
@tomvothecoder Outstanding work, both for exit handling and for elucidating the "--info" mode options. (I can't imaging testing all combinations). |
I'm happy to get this going and hope it significantly improves the efficiency of debugging publication workflows!
Thankfully in past refactoring efforts, I generated test cases with regridded data that I can re-use to test these cases. I'll post the test scripts above once I complete testing. |
|
I have v3 data where I know failures will occur (CFmon.clisccp "NaNs" issue is one, and piClim-control-iceini triggers CMOR failure due to bad user-metadata file). The former may fail in NCO phase - before e2c is called, however. Previously, when info-mode failed, I was not catching that fact, and was passing an empty var-list to ncclimo, which responds by extracting ALL variables and attempting regrid on everything - not the best default in my view. You can induce a metadata failure by introducing a term in the "activity_id" that is not in the current CV: |
|
@tomvothecoder The code looks very good! I downloaded the branch so I could follow the logic more completely. (Minor initial confusion on terms. I knew that a failed "info" mode would indicate a "handler" failure (failure to resolve a handler), but did not think of a runtime CMOR failure (perhaps bad data) as being a "failed handler", but I get it now - all failures pass through "finalize_failure_exit". I will create a dev env to install and test this behavior (both info mode, and runtime CMOR errors) |
@TonyB9000 Heads up, I'm still working on the code to ensure the implementation is complete and correct. I'll tag you again as needed. Thanks for being eager to test! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TonyB9000 In my most recent commit, 258794a (#323), I applied the behaviors of stop and fail on the initial process of deriving variable handlers.
These two cases will now end with sys.exit(1):
- Handler(s) is not defined for a variable (aka missing)
- Handler(s) is defined for a variable, but the input dataset(s) don't have the necessary raw E3SM variables.
I've highlighted the relevant code below in my review.
- Fix bug in `_get_handlers()` not instantianting `missing_handlers` after `_get_mpas_handlers()` call - Add FIXME: comments for duplicate code - Extract stop behaviors to `_stop_with_failed_handler()` and `_stop_with_failed_handler_parallel()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new --on-var-failure command-line flag to provide explicit control over how e3sm_to_cmip handles variable handler failures, replacing the previous always-succeed behavior with user-configurable exit modes.
Key changes:
- Added
--on-var-failureflag with three modes:ignore(default),fail, andstop - Updated all run modes (serial, parallel, info) to respect failure semantics and exit appropriately
- Enhanced logging and error reporting to provide clearer feedback on handler failures
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| e3sm_to_cmip/argparser.py | Adds the new --on-var-failure command-line argument |
| e3sm_to_cmip/runner.py | Core logic updates for failure handling across all run modes |
| e3sm_to_cmip/util.py | Adds convenience functions for consistent exit behavior |
| e3sm_to_cmip/cmor_handlers/utils.py | Updates handler loading functions to return missing/non-derivable handlers |
| e3sm_to_cmip/cmor_handlers/handler.py | Adds type alias for handler dictionaries |
| tests/cmor_handlers/test_utils.py | Updates test assertions to handle new tuple return values |
| docs/source/usage.rst | Documents the new flag and its behavior |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @TonyB9000, this PR is ready for your code review and testing.
I've tested most of these cases successfully. The only ones that need more thorough testing are for CMOR failures during "stop" mode with parallel processing. It should gracefully stop by allowing running jobs to complete and cancels pending jobs to ensure non-partial outputs, logs. Can you try to run this case? For example, run 2-3 variables and introduce faulty values for one of the variables that causes CMOR errors.
| "Variable Failure Behavior (--on-var-failure)": self.on_var_failure, | ||
| "Variable List (--var-list)": f"{self.var_list} ({len(self.var_list)})", | ||
| "Input Path (--input-path)": self.input_path, | ||
| "Output Path (--output-path)": self.output_path, | ||
| "Precheck Path (--precheck)": self.precheck_path, | ||
| "Log Path (--logdir)": self.log_path, | ||
| "CMOR Log Path (--logdir)": self.cmor_log_dir, | ||
| "CMIP Metadata Path (--user-metadata)": self.new_metadata_path, | ||
| "Temp Path for Processing MPAS Files": self.temp_path, | ||
| "Frequency": self.freq, | ||
| "Realm": self.realm, | ||
| "Frequency (--freq)": self.freq, | ||
| "Realm (--realm)": self.realm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved log summary for e2c configuration, now includes the CLI arg if applicable.
| if self.info_mode: | ||
| self._run_info_mode() | ||
| sys.exit(0) | ||
| exit_success() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced sys.exit(0) and sys.exit(1) with exit_success() and exit_failure(), respectively.
| def _log_handler_summary(self): | ||
| """ | ||
| Logs a summary of the derived CMOR handlers, including any missing or | ||
| non-derivable handlers. | ||
| """ | ||
| if self.handlers: | ||
| cmip_to_e3sm_vars = { | ||
| handler["name"]: handler["raw_variables"] for handler in self.handlers | ||
| } | ||
|
|
||
| logger.info("--------------------------------------") | ||
| logger.info("| SUCCESS: Derived Variable Handlers") | ||
| logger.info("--------------------------------------") | ||
| logger.info(f" * Count: {len(self.handlers)}") | ||
| logger.info(" * Variable Mappings (CMIP to E3SM):") | ||
| for k, v in cmip_to_e3sm_vars.items(): | ||
| logger.info(f" * '{k}' -> {v}") | ||
|
|
||
| if self.missing_handlers: | ||
| logger.error("--------------------------------------") | ||
| logger.error("| NOTICE: Missing Handlers") | ||
| logger.error("---------------------------------------") | ||
| logger.error( | ||
| "Solution: Make sure handlers for these variables are defined " | ||
| "in `handlers.yaml`." | ||
| ) | ||
| logger.error(f" * Count: {len(self.missing_handlers)}") | ||
| logger.error(f" * Variables: {self.missing_handlers}") | ||
|
|
||
| if self.non_derivable_handlers: | ||
| logger.error("--------------------------------------") | ||
| logger.error("| NOTICE: Non-derivable Handlers") | ||
| logger.error("---------------------------------------") | ||
| logger.error( | ||
| "Handlers were defined for these variables, but they could not " | ||
| "be derived using the input E3SM datasets." | ||
| ) | ||
| logger.error( | ||
| "Possible Reasons: 1) No matching CMIP table was found for the " | ||
| "requested frequency or 2) The input E3SM datasets don't have " | ||
| "the required variables." | ||
| ) | ||
| logger.error(f" * Count: {len(self.non_derivable_handlers)}") | ||
| logger.error(f" * Variables: {self.non_derivable_handlers}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method logs the summary for handlers after attempting to derive them for each of the variables (--var-list).
| def _exit_due_to_handler_issues(self) -> bool: | ||
| """ | ||
| Determines if the program should exit due to missing or non-derivable | ||
| handlers based on the ``on_var_failure`` setting. | ||
| Returns | ||
| ------- | ||
| bool | ||
| True if the program should exit, False otherwise. | ||
| """ | ||
| if not self.handlers: | ||
| logger.error( | ||
| "No variable handlers are defined or derivable from the raw " | ||
| "variables found in the E3SM input datasets." | ||
| ) | ||
| return True | ||
|
|
||
| if self.missing_handlers or self.non_derivable_handlers: | ||
| if self.on_var_failure in ["stop", "fail"]: | ||
| logger.error( | ||
| "Exiting due to missing or non-derivable handlers with " | ||
| f"--on-var-failure={self.on_var_failure}." | ||
| ) | ||
|
|
||
| return True | ||
|
|
||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method determines whether to exit or not based on the status of handlers post-attempt at derivation. It also depends on --on-var-failure "stop" or "fail".
| # FIXME: This check is duplicated in mode 3 below. Refactor. | ||
| # --- DUPLICATE CODE --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments to remove duplicate code from a previous PR.
| if not is_cmor_successful: | ||
| self._stop_with_failed_handler(handler["name"]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop behavior for failed handler during cmorizing.
| return False | ||
|
|
||
| self._log_final_result(num_handlers, num_success, failed_handlers) | ||
| self._finalize_on_failure(failed_handlers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finalize on "fail" mode.
| if not future_result: | ||
| self._stop_with_failed_handler_parallel( | ||
| handler_name, pool, pbar, futures | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gracefully stop parallel jobs with "stop" mode.
| pbar.close() | ||
| pool.shutdown() | ||
| self._log_final_result(num_handlers, num_success, failed_handlers) | ||
| self._finalize_on_failure(failed_handlers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finalize on "fail" mode.
| def _log_final_result( | ||
| self, num_handlers: int, num_successes: int, failed_handlers: list[str] | ||
| ): | ||
| """ | ||
| Logs the final result of the CMORization process. | ||
| Parameters | ||
| ---------- | ||
| num_handlers : int | ||
| The total number of handlers that were processed. | ||
| num_successes : int | ||
| The number of handlers that completed successfully. | ||
| failed_handlers : list[str] | ||
| A list of handler names that failed during processing. | ||
| """ | ||
| logger.info("========== FINAL RUN RESULTS ==========") | ||
| logger.info(f"* {num_successes} of {num_handlers} handlers succeeded.") | ||
| logger.info("") | ||
| logger.info("=======================================") | ||
| logger.info("| FINAL RUN SUMMARY") | ||
| logger.info("---------------------------------------") | ||
| logger.info(f" * Total variables (--var-list): {len(self.var_list)}") | ||
| logger.info(f" * Total handlers successfully derived: {num_handlers}") | ||
| logger.info( | ||
| f" * Total handlers successfully cmorized: {num_successes} / {num_handlers}" | ||
| ) | ||
|
|
||
| if failed_handlers: | ||
| logger.error( | ||
| "* The following handlers failed: " | ||
| + ", ".join(str(h) for h in failed_handlers) | ||
| f" * Total handlers failed to cmorize: {len(failed_handlers)}" | ||
| ) | ||
| else: | ||
| logger.info("* All handlers completed successfully.") | ||
| logger.error(f" - Failed variables: {failed_handlers}") | ||
|
|
||
| if self.missing_handlers: | ||
| logger.error( | ||
| f" * Total handlers missing (not defined in handlers.yaml): " | ||
| f"{len(self.missing_handlers)}" | ||
| ) | ||
| logger.error(f" - Includes: {self.missing_handlers}") | ||
|
|
||
| if self.non_derivable_handlers: | ||
| logger.error( | ||
| f" * Total handlers non-derivable (defined but not derivable): " | ||
| f"{len(self.non_derivable_handlers)}" | ||
| ) | ||
| logger.error(f" - Includes: {self.non_derivable_handlers}") | ||
|
|
||
| logger.info("=======================================") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved final run summary formatting with helpful info on missing and non-derivable handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@tomvothecoder I did some testing earlier and the new exit (fail) mode works great (both for info-mode and run-mode). I will re-run these tests with the latest PR. I assume this gets me there: pip install, etc ... |
|
@tomvothecoder There is an issue with info-mode - maybe not directly related to the exit handling. I ran the command: the output to the console reads: The output to the file Now, the "--help" never says that (It turns out the for my "Bad_Metadata" test, the first dataset I tried was the I'll conduct a test with a different variable - but now I think we need to test all "ambiguous var-name" sets (where multiple frequencies are involved). |
|
@tomvothecoder Update. This may not be a real problem. There are only two handlers for "pr", mon.json and day,json. There is no 3hr.json. There should be for name-consistency. Currently the one labeled "Amon_day.json" is really "Amon_sub_mon.json". I will continue to investigate (look for failures). |
Hey @TonyB9000, any updates on your review for this PR? If you approve, I will do a final self-review before merging. |
Motivation
Previously,
e3sm_to_cmipalways returned an exit code 0 even if multiple handlers failed.This made it difficult for calling workflows or CI/CD pipelines to detect and respond to errors automatically.
This enhancement introduces explicit, user-controlled failure semantics while keeping the default behavior unchanged for backward compatibility.
Overview
This PR improves how
e3sm_to_cmipreports and responds to handler failures across all run modes. It introduces a new command-line flag,--on-var-failure, which allows users to control howe3sm_to_cmipbehaves when one or more variable handlers fail during CMORization or info-mode checks.The new flag replaces the implicit always-succeed behavior with three explicit modes:
ignore(default)failstopComparison of before and after this PR:
"ignore"), exit after all failures ("fail"), or stop immediately ("stop").--on-var-failure=stop--on-var-failure, logging and exiting consistently.Result: More predictable, script-friendly behavior, improved workflow integration, and safer, cleaner exits during parallel CMORization runs.
Closes #272
Details
--on-var-failure {ignore,fail,stop}_run_serial(),_run_parallel(), and_run_info_mode()to honorself.on_var_failure._handle_failed_handler()— logs, records, and optionally triggers immediate exit._finalize_failure_exit()— applies final exit logic at the end of processing.--on-var-failure=ignore(default).How Failures Are Handled with Parallel Jobs with
stopWith
--on-var-failure=stop, it gracefully cancels pending jobs but allows active ones to complete before exiting.Compared to exiting immediately like with serial and info modes, the result is: Cleaner shutdowns, fewer partial outputs, and consistent logs and progress updates during parallel CMORization runs.
Implementation Notes
self.on_var_failureis now a class attribute shared across serial, parallel, and info modes.failed_handlersand processed through_handle_failed_handler()._finalize_failure_exit()centralizes the exit decision logic._run_info_mode()now also respects--on-var-failure, treating missing variables or invalid table entries as handler failures.Backward Compatibility
--on-var-failure=ignore.Checklist
If applicable: