Fix platform selection from subshell not re-evaluating each time#6836
Conversation
a79430e to
83964b2
Compare
a task rtconfig.
83964b2 to
005c5c0
Compare
| @@ -0,0 +1,51 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
I tried to write this as an integration test... But couldn't make it play ball.
|
@wxtim, did you mean to raise this on 8.4.x? |
|
I assume so as this is branched off 8.4.x. I've updated the PR |
Yes |
cylc/flow/task_job_mgr.py
Outdated
| if orig_platform_name: | ||
| rtconfig['platform'] = orig_platform_name | ||
| if orig_host_name: | ||
| rtconfig['remote']['host'] = orig_host_name |
There was a problem hiding this comment.
With this change, we're still storing the platform in the rtconfig, we're just putting it back the way it was after.
Unfortunately, this means the potential for interaction is still there. I think all instances of the task which submit in the same batch will use the same platform?
Because of the returns above, I think eternal caching is still possible, e.g, if the shbshell returns a broken platoform:
E.g, take this example and change functional to the name of a platform:
[scheduling]
cycling mode = integer
initial cycle point = 1
runahead limit = P2
[[graph]]
P1 = foo
[runtime]
[[foo]]
platform = $(python -c 'import random; import sys; print(random.choice(sys.argv[1:]))' broken1 broken2 broken3 functional)Once you've got some submit-fails, trigger all submit-failed tasks over and over. It will keep picking the same broken platform over and over.
I think we've going to have to pass the platform as an argument and cut out all rtconfig manipulation.
There was a problem hiding this comment.
BTW, I'm not sure what the intended behaviour of platform subshells is when interacting with batched submissions.
E.G, if you have 5 tasks with the same platform = $(subshell):
- Should all 5 go to the same platform (single batched submission -> more efficient).
- Should the subshell be evaluated 5 times (multiple submissions -> more even load).
I would be tempted to say, whatever it's doing at the moment, assume it's right and preserve that behaviour.
There was a problem hiding this comment.
From our operational PoV, I think it's fine for all tasks with the same subshell to only evaluate this once if they are submitted in the same batch, as the subshell simply selects the "live HPC" platform
There was a problem hiding this comment.
I suspect that opps doesn't actually need subshells since platforms are updated by broadcast when changed anyway and switching hall at an arbitrary point in a workflow won't work (unless that arbitrary point just happens to be a data sync task).
I'm guessing the only reason they do this is to provide tasks with a default platform when the workflow is first started? From there in broadcasts take over? If so, we can probs just flatten this out in the config:
{% import "os" as os %}
{% set live_hall = os.fdopen(os.open('live-hall-file', os.O_RDONLY)).read() %}
[runtime]
[[HPC]]
platform = {{ live_hall }}This would save a lot of subshell calls and make submissions a tad snappier.
There was a problem hiding this comment.
There are tasks that run several hours in advance of the broadcast, to create PBS reservations on the live HPC. So unfortunately the Jinja2 example wouldn't cut it.
I've tested platform = $( echo evaluated >> ~/platform_subshell.txt ) with one of the workflows and I don't think it gets called very often, as most of the time the broadcast applies.
d6a0d17 to
72daa89
Compare
cbccf02 to
1a69422
Compare
|
@MetRonnie - I've dismissed your earlier review since this now works rather differently |
|
@oliver-sanders Poke |
This comment was marked as resolved.
This comment was marked as resolved.
| f"for task {itask.identity}: platform = " | ||
| f"{rtconfig['platform']} evaluated as {platform_name}" | ||
| ) | ||
| rtconfig['platform'] = platform_name |
Closes #6808 TL;DR
rtconfig['platform']was being changed to the result ofplatform = $(subshell commands), causing the subshell command being run once for eachrtconfigand that result being fixed.Check List
CONTRIBUTING.mdand added my name as a Code Contributor.setup.cfg(andconda-environment.ymlif present).?.?.xbranch.