Skip to content

expire: fix interaction with automatic retries#7056

Merged
MetRonnie merged 2 commits intocylc:8.6.xfrom
oliver-sanders:6717
Mar 27, 2026
Merged

expire: fix interaction with automatic retries#7056
MetRonnie merged 2 commits intocylc:8.6.xfrom
oliver-sanders:6717

Conversation

@oliver-sanders
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders commented Oct 22, 2025

(Note, this issue needs to be resolved before expire triggers can be used in anger)

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders self-assigned this Oct 22, 2025
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Oct 22, 2025
@oliver-sanders oliver-sanders added this to the 8.5.x milestone Oct 22, 2025
@oliver-sanders
Copy link
Copy Markdown
Member Author

(rebased)

@oliver-sanders oliver-sanders force-pushed the 6717 branch 2 times, most recently from a43265f to fcf54ea Compare February 18, 2026 16:37
@oliver-sanders oliver-sanders requested a review from wxtim February 19, 2026 10:17
@oliver-sanders oliver-sanders modified the milestones: 8.6.x, 8.6.4 Mar 2, 2026
@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Mar 20, 2026

OK, on master, the forced argument in process_message() only comes from the set command:

On manually setting an output, we call process_message() just like a naturally completed output would, so that all the normal consequences of completing an output follow; but we add the forced arg to indicate that it was not natural completion, because a few of those consequences need to be different in the manual case.

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Mar 20, 2026

Taking a closer look at the bug, the loop in #6717 is caused by this:

  • scheduler detects that the task should clock expire
  • so it sends process_message(clock-expired)
  • but the _process_message_check() helper says ignore non-forced messages from tasks waiting on a retry (+)
  • so the task does not expire
  • scheduler detects that the task should clock expire ... (repeat)

In principle this bug affects other outputs too (e.g. succeeded) but in practice it doesn't because the scheduler never tries to automatically succeed a task waiting on a retry. But you can cylc message the succeeded output to simulate natural completion (not cylc set) of such a task and you'll see the same "ignored" message in the log.

The (+) line above explains why force=True allows the task to expire (which then does abort the retry sequence as it should for any final status).

I think this fix (forced=True) is probably OK, if you change the argument documentation slightly to say it means manual command (cylc set) or automatic expiry. The "forced" terminology is arguably still justified, since clock-expiry sort of artificially forces the task to bail out of its normal life cycle.

HOWEVER there is at least one minor undesirable side effect though: the DB task_outputs table records expired: (manually completed).

And there could be others too - we'd need to check the consequences everywhere the handling of natural vs forced messages is different, and it makes me a bit nervous because the expiry is actually natural not forced.

Copy link
Copy Markdown
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my analysis of the problem above, I think ideally forced=True should be reserved for manual output completion (i.e. cylc set) and a better fix for the bug is to tweak task_message_check() to stop it ignoring natural expired messages for tasks waiting on retries.

I've checked this diff on master works:

diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index 91d0ec0c6..44e82d8b4 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -932,6 +932,7 @@ class TaskEventsManager():

         if (
             itask.state(TASK_STATUS_WAITING)
+            and message != TASK_OUTPUT_EXPIRED
             # Polling in live mode only:
             and itask.run_mode == RunMode.LIVE
             and (

The comment in this code block is:

            # Ignore messages if task has a retry lined up
            # (caused by polling overlapping with task failure)

I guess that means e.g. the task polls as running, then fails and goes to waiting on retry before the poll result comes in, in which case the poll result should be ignored.

That scenario doesn't apply to automatic expiry, which is not like a delayed poll result.

@oliver-sanders
Copy link
Copy Markdown
Member Author

oliver-sanders commented Mar 20, 2026

Ok, thanks for the analysis, will take a look...

@oliver-sanders
Copy link
Copy Markdown
Member Author

Re-worked with @hjoliver's suggestion.

Confirmed it addresses the example in #6717, also tested it with an example where :expired? is handled.

Copy link
Copy Markdown
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just expand the comment in the tweaked code block for future reference, as this stuff isn't particularly easy to follow. Something like this:

# Ignore non-expire messages if task is waiting with a retry lined up.
# Waiting tasks normally advance to a new state due to any message,
# but the retry implies late arrival after task failure (e.g. a delayed poll
# result). Task expire messages are internal, so excluded from this.

@oliver-sanders
Copy link
Copy Markdown
Member Author

(added comment)

@oliver-sanders oliver-sanders requested review from MetRonnie and hjoliver and removed request for wxtim March 25, 2026 16:17
Co-authored-by: Tim Pillinger <26465611+wxtim@users.noreply.github.com>
@hjoliver
Copy link
Copy Markdown
Member

(All tests passed; just merged the in-comment typo fix with skip-ci).

@MetRonnie MetRonnie merged commit 3b0c945 into cylc:8.6.x Mar 27, 2026
@oliver-sanders oliver-sanders deleted the 6717 branch March 30, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is wrong :(

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants