Skip to content

host-select: fix compatibility with force-condemned hosts#6623

Closed
oliver-sanders wants to merge 3 commits intocylc:8.4.xfrom
oliver-sanders:host-select+force-condemned
Closed

host-select: fix compatibility with force-condemned hosts#6623
oliver-sanders wants to merge 3 commits intocylc:8.4.xfrom
oliver-sanders:host-select+force-condemned

Conversation

@oliver-sanders
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders commented Feb 19, 2025

The "force mode" option in the auto-restart feature hasn't been used for a while.

It still works at Cylc 8, but force-condemning a host causes workflow startup to fail.

This PR:

For info on the "force mode", see the Cylc 7 docs: https://cylc.github.io/cylc-doc/7.9.3/html/running-suites.html#auto-stop-restart

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened - soon, soon ...
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders added this to the 8.4.2 milestone Feb 19, 2025
@oliver-sanders oliver-sanders self-assigned this Feb 19, 2025
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Feb 19, 2025
blacklist = []
for host in global_config.get(['scheduler', 'run hosts', 'condemned'], []):
if host.endswith('!'):
host = host[:-1]
Copy link
Copy Markdown
Member Author

@oliver-sanders oliver-sanders Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this line is covered by the test amended in this PR (confirm by running it on master), however, that test is not run in CI due to it's shared filesystem setup.

Copy link
Copy Markdown
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just some minor verbosity-reduction suggestions.

A couple of big-picture questions though:

  • It's weird that the documentation had been deliberately deleted - does that suggest there was an intention to remove this option??
  • Maybe it's bedded in for historical reasons as this point but I'm not sure "force-condemn" is a very good name for this feature. Is it because there are some workflows that auto-migrate won't touch? (long-running background tasks?). But even then, shouldn't force-condemn try to auto-migrate first, then kill any remaining workflows that failed to migrate? Given that it simply kills all the workflows, something like "kill and condemn" would be a better name, to indicate that auto-migration is not even attempted.

@oliver-sanders
Copy link
Copy Markdown
Member Author

oliver-sanders commented Feb 24, 2025

It's weird that the documentation had been deliberately deleted

I think this more slipped the net than was deliberately deleted.


Maybe it's bedded in for historical reasons as this point but I'm not sure "force-condemn" is a very good name

Happy to change, suggest an alternative.


shouldn't force-condemn try to auto-migrate first, then kill any remaining workflows that failed to migrate

Some workflows cannot auto-migrate (e.g. workflows started in --no-detach mode) or might fail to migrate (e.g. user has scrambled their SSH config since starting the workflow).

The idea was to condemn the host first (i.e. ask workflows to migrate nicely), then mop up any stragglers with the "force" mode.

Co-authored-by: Hilary James Oliver <hilary.j.oliver@gmail.com>
@hjoliver
Copy link
Copy Markdown
Member

The idea was to condemn the host first (i.e. ask workflows to migrate nicely), then mop up any stragglers with the "force" mode.

OK that's fine. Leave the decision on when to force it up to the humans.

We should document that though, as some will assume the options are migrate-all or kill-all, and not think of migrate-if-possible before changing the global config again to kill the rest.

We should also document the circumstances under which workflows will not successfully auto-migrate.

I think the best place for that is the general "pool of hosts" doc in the user guide, would you agree?

@oliver-sanders
Copy link
Copy Markdown
Member Author

Yes, a cylc-doc PR will follow (see OP), it shouldn't go here in the config docs.

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 26, 2025

I put up a cylc-doc issue, just to record that we need to document why force-mode is needed in the first place (some workflows will not auto-migrate): cylc/cylc-doc#809

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 26, 2025

Regards the name "force condemn" ...

A host is "condemned" as soon as it is listed as such, whether or not it still has workflows running on it. The word implies a host shutdown at some future time.

(C.f. a condemned man hasn't been hung by the neck just yet; and a condemned building hasn't been demolished yet.)

I think we can just say that workflows running on condemned hosts will be told to auto-migrate, unless the hostname is followed by "!" in which case they will be shut down. There's probably no need to call that forced condemnation, right?

I'll re-review and provide new code suggestions along those lines ...

Comment on lines +1 to +4
Auto restart: The "force condemn" option (that tells workflows running on a
server to shutdown as opposed to migrate) hasn't worked with the host-selection
mechanism since Cylc 8.0.0. This has now been fixed and the "force condemn"
option has been restored in the documentation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Auto restart: The "force condemn" option (that tells workflows running on a
server to shutdown as opposed to migrate) hasn't worked with the host-selection
mechanism since Cylc 8.0.0. This has now been fixed and the "force condemn"
option has been restored in the documentation.
Restored the option to shut down rather than auto-migrate workflows on a condemned
host. This option hasn't worked with the host-selection mechanism since Cylc 8.0.0.

Copy link
Copy Markdown
Member Author

@oliver-sanders oliver-sanders Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually correct. The mode has not been restored, a bug in the host-selection mechanism has been fixed.

Comment on lines +835 to +836
* Workflows that are running on condemned hosts will attempt
to migrate to an available host (providing the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Workflows that are running on condemned hosts will attempt
to migrate to an available host (providing the
* By default, workflows running on condemned hosts will attempt
to migrate to an available host (providing the

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous in combo with next paragraph.

Comment on lines +844 to +847
If a hostname listed here is followed by a ``!`` character
("force mode"), workflows running on it
will shutdown rather than attempting to
migrate (providing the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If a hostname listed here is followed by a ``!`` character
("force mode"), workflows running on it
will shutdown rather than attempting to
migrate (providing the
If a hostname listed here is followed by a ``!`` character
workflows running on it will be told to shut down
rather than migrate to another host (providing the

# however two have been taken out:
# * workflows running on "host1" will attempt to
# restart on "host3"
# * workflows running on "host2" will shutdown
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# * workflows running on "host2" will shutdown
# * workflows running on "host2" will shut down

Comment on lines +873 to +875
The force-condemn ("!") option caused issues at workflow
startup for Cylc versions between 8.0.0 and 8.4.1
inclusive.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The force-condemn ("!") option caused issues at workflow
startup for Cylc versions between 8.0.0 and 8.4.1
inclusive.
The option to shut down ("hostname!") rather than migrate
("hostname") workflows on condemned hosts caused problems
at workflow startup for Cylc versions 8.0.0 through 8.4.1.

Comment on lines +54 to +55
# ensure the workflow can start if a host is force-condemned
# see #6623
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ensure the workflow can start if a host is force-condemned
# see #6623
# ensure the workflow can start if previously shut down on a
# condemned host. See #6623

@oliver-sanders
Copy link
Copy Markdown
Member Author

I think we can just say that workflows running on condemned hosts will be told to auto-migrate, unless the hostname is followed by "!" in which case they will be shut down. There's probably no need to call that forced condemnation, right?

I think we need a name for this. It would be clunky to cross-reference this from the docs otherwise.

Please suggest an alternative to "force". I will keep using "force" as a placeholder, happy to change, but need something to change it to.

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 27, 2025

I have suggested an alternative. It's not simply a replacement for the word "force" in "force condemned", because - given the meaning of the word "condemned" - that just doesn't make sense.

See my response on the cylc-doc issue: cylc/cylc-doc#809 (comment)

@oliver-sanders
Copy link
Copy Markdown
Member Author

Turns out this was already well documented, I just didn't look very hard.

https://cylc.github.io/cylc-doc/stable/html/plugins/main-loop/built-in/cylc.flow.main_loop.auto_restart.html#module-cylc.flow.main_loop.auto_restart

I'm going to close this PR and re-raise it with simpler docs pointing at the canonical documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is wrong :( superseded

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants