Skip to content

Fix platform selection from subshell not re-evaluating each time#6836

Merged
oliver-sanders merged 6 commits intocylc:8.5.xfrom
wxtim:#6808.fix.platform_selection_from_subshell
Aug 18, 2025
Merged

Fix platform selection from subshell not re-evaluating each time#6836
oliver-sanders merged 6 commits intocylc:8.5.xfrom
wxtim:#6808.fix.platform_selection_from_subshell

Conversation

@wxtim
Copy link
Copy Markdown
Member

@wxtim wxtim commented Jul 3, 2025

Closes #6808 TL;DR rtconfig['platform'] was being changed to the result of platform = $(subshell commands), causing the subshell command being run once for each rtconfig and that result being fixed.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@wxtim wxtim force-pushed the #6808.fix.platform_selection_from_subshell branch from a79430e to 83964b2 Compare July 3, 2025 08:39
@wxtim wxtim force-pushed the #6808.fix.platform_selection_from_subshell branch from 83964b2 to 005c5c0 Compare July 3, 2025 08:44
@wxtim wxtim requested review from hjoliver and oliver-sanders July 3, 2025 08:45
@@ -0,0 +1,51 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to write this as an integration test... But couldn't make it play ball.

@MetRonnie MetRonnie requested review from MetRonnie and removed request for hjoliver July 11, 2025 10:51
@MetRonnie MetRonnie added the bug Something is wrong :( label Jul 11, 2025
@MetRonnie MetRonnie added this to the 8.5.0 milestone Jul 11, 2025
@oliver-sanders
Copy link
Copy Markdown
Member

@wxtim, did you mean to raise this on 8.4.x?

@MetRonnie MetRonnie modified the milestones: 8.5.0, 8.4.4 Jul 14, 2025
@MetRonnie MetRonnie changed the base branch from master to 8.4.x July 14, 2025 14:37
@MetRonnie
Copy link
Copy Markdown
Member

I assume so as this is branched off 8.4.x. I've updated the PR

Copy link
Copy Markdown
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxtim
Copy link
Copy Markdown
Member Author

wxtim commented Jul 18, 2025

@wxtim, did you mean to raise this on 8.4.x?

Yes

Comment on lines +1313 to +1316
if orig_platform_name:
rtconfig['platform'] = orig_platform_name
if orig_host_name:
rtconfig['remote']['host'] = orig_host_name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we're still storing the platform in the rtconfig, we're just putting it back the way it was after.

Unfortunately, this means the potential for interaction is still there. I think all instances of the task which submit in the same batch will use the same platform?

Because of the returns above, I think eternal caching is still possible, e.g, if the shbshell returns a broken platoform:

E.g, take this example and change functional to the name of a platform:

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  runahead limit = P2
  [[graph]]
    P1 = foo

[runtime]
  [[foo]]
    platform = $(python -c 'import random; import sys; print(random.choice(sys.argv[1:]))' broken1 broken2 broken3 functional)

Once you've got some submit-fails, trigger all submit-failed tasks over and over. It will keep picking the same broken platform over and over.

I think we've going to have to pass the platform as an argument and cut out all rtconfig manipulation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I did wonder.

Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm not sure what the intended behaviour of platform subshells is when interacting with batched submissions.

E.G, if you have 5 tasks with the same platform = $(subshell):

  • Should all 5 go to the same platform (single batched submission -> more efficient).
  • Should the subshell be evaluated 5 times (multiple submissions -> more even load).

I would be tempted to say, whatever it's doing at the moment, assume it's right and preserve that behaviour.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our operational PoV, I think it's fine for all tasks with the same subshell to only evaluate this once if they are submitted in the same batch, as the subshell simply selects the "live HPC" platform

Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that opps doesn't actually need subshells since platforms are updated by broadcast when changed anyway and switching hall at an arbitrary point in a workflow won't work (unless that arbitrary point just happens to be a data sync task).

I'm guessing the only reason they do this is to provide tasks with a default platform when the workflow is first started? From there in broadcasts take over? If so, we can probs just flatten this out in the config:

{% import "os" as os %}
{% set live_hall = os.fdopen(os.open('live-hall-file', os.O_RDONLY)).read() %}

[runtime]
  [[HPC]]
    platform = {{ live_hall }}

This would save a lot of subshell calls and make submissions a tad snappier.

Copy link
Copy Markdown
Member

@MetRonnie MetRonnie Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tasks that run several hours in advance of the broadcast, to create PBS reservations on the live HPC. So unfortunately the Jinja2 example wouldn't cut it.

I've tested platform = $( echo evaluated >> ~/platform_subshell.txt ) with one of the workflows and I don't think it gets called very often, as most of the time the broadcast applies.

@oliver-sanders oliver-sanders modified the milestones: 8.4.4, 8.4.x Jul 18, 2025
@wxtim wxtim force-pushed the #6808.fix.platform_selection_from_subshell branch from d6a0d17 to 72daa89 Compare July 18, 2025 14:48
@wxtim wxtim force-pushed the #6808.fix.platform_selection_from_subshell branch from cbccf02 to 1a69422 Compare July 21, 2025 11:07
@wxtim
Copy link
Copy Markdown
Member Author

wxtim commented Jul 21, 2025

@MetRonnie - I've dismissed your earlier review since this now works rather differently

Copy link
Copy Markdown
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only linkcheck failing flakily (twice in a row for 2 different links)

@wxtim wxtim self-assigned this Jul 31, 2025
@wxtim
Copy link
Copy Markdown
Member Author

wxtim commented Aug 14, 2025

@oliver-sanders Poke

@MetRonnie MetRonnie modified the milestones: 8.4.x, 8.5.2 Aug 14, 2025
@oliver-sanders oliver-sanders changed the base branch from 8.4.x to 8.5.x August 18, 2025 10:47
@oliver-sanders

This comment was marked as resolved.

@oliver-sanders oliver-sanders merged commit c3e1d20 into cylc:8.5.x Aug 18, 2025
28 checks passed
@wxtim wxtim deleted the #6808.fix.platform_selection_from_subshell branch August 26, 2025 08:54
@MetRonnie MetRonnie changed the title 6808.fix Platform selection from subshell Fix platform selection from subshell not re-evaluating each time Sep 3, 2025
f"for task {itask.identity}: platform = "
f"{rtconfig['platform']} evaluated as {platform_name}"
)
rtconfig['platform'] = platform_name
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference in #6990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is wrong :(

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subshell platform selection appears to only work for the first task of each name

3 participants