Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes.d/6836.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix a bug causing the results of `platform = $(subshell commands)` to be cached, and preventing re-evaluation for each task with the same config.
11 changes: 10 additions & 1 deletion cylc/flow/task_job_mgr.py
Original file line number Diff line number Diff line change
Expand Up @@ -1236,7 +1236,7 @@
f"\"{itask.identity}\" the following are not compatible:\n"
)

host_n, platform_name = None, None
host_n, platform_name, orig_platform_name, orig_host_name = [None] * 4

Check warning on line 1239 in cylc/flow/task_job_mgr.py

View check run for this annotation

Codecov / codecov/patch

cylc/flow/task_job_mgr.py#L1239

Added line #L1239 was not covered by tests
try:
if rtconfig['remote']['host'] is not None:
host_n = self.task_remote_mgr.eval_host(
Expand Down Expand Up @@ -1269,6 +1269,7 @@
f"for task {itask.identity}: platform = "
f"{rtconfig['platform']} evaluated as {platform_name}"
)
orig_platform_name = rtconfig['platform']
rtconfig['platform'] = platform_name
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference in #6990

elif (
platform_name is None
Expand All @@ -1278,6 +1279,7 @@
f"[{itask}] host = "
f"{rtconfig['remote']['host']} evaluated as {host_n}"
)
orig_host_name = host_n
rtconfig['remote']['host'] = host_n

try:
Expand Down Expand Up @@ -1306,6 +1308,13 @@
# Retry delays, needed for the try_num
self._set_retry_timers(itask, rtconfig)

# Put the original platform and host names back into the
# rtconfig. Prevents storing first evaluation of shell expression.
if orig_platform_name:
rtconfig['platform'] = orig_platform_name
if orig_host_name:
rtconfig['remote']['host'] = orig_host_name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we're still storing the platform in the rtconfig, we're just putting it back the way it was after.

Unfortunately, this means the potential for interaction is still there. I think all instances of the task which submit in the same batch will use the same platform?

Because of the returns above, I think eternal caching is still possible, e.g, if the shbshell returns a broken platoform:

E.g, take this example and change functional to the name of a platform:

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  runahead limit = P2
  [[graph]]
    P1 = foo

[runtime]
  [[foo]]
    platform = $(python -c 'import random; import sys; print(random.choice(sys.argv[1:]))' broken1 broken2 broken3 functional)

Once you've got some submit-fails, trigger all submit-failed tasks over and over. It will keep picking the same broken platform over and over.

I think we've going to have to pass the platform as an argument and cut out all rtconfig manipulation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I did wonder.

Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm not sure what the intended behaviour of platform subshells is when interacting with batched submissions.

E.G, if you have 5 tasks with the same platform = $(subshell):

  • Should all 5 go to the same platform (single batched submission -> more efficient).
  • Should the subshell be evaluated 5 times (multiple submissions -> more even load).

I would be tempted to say, whatever it's doing at the moment, assume it's right and preserve that behaviour.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our operational PoV, I think it's fine for all tasks with the same subshell to only evaluate this once if they are submitted in the same batch, as the subshell simply selects the "live HPC" platform

Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that opps doesn't actually need subshells since platforms are updated by broadcast when changed anyway and switching hall at an arbitrary point in a workflow won't work (unless that arbitrary point just happens to be a data sync task).

I'm guessing the only reason they do this is to provide tasks with a default platform when the workflow is first started? From there in broadcasts take over? If so, we can probs just flatten this out in the config:

{% import "os" as os %}
{% set live_hall = os.fdopen(os.open('live-hall-file', os.O_RDONLY)).read() %}

[runtime]
  [[HPC]]
    platform = {{ live_hall }}

This would save a lot of subshell calls and make submissions a tad snappier.

Copy link
Copy Markdown
Member

@MetRonnie MetRonnie Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tasks that run several hours in advance of the broadcast, to create PBS reservations on the live HPC. So unfortunately the Jinja2 example wouldn't cut it.

I've tested platform = $( echo evaluated >> ~/platform_subshell.txt ) with one of the workflows and I don't think it gets called very often, as most of the time the broadcast applies.


try:
job_conf = self._prep_submit_task_job_impl(
workflow,
Expand Down
51 changes: 51 additions & 0 deletions tests/functional/platforms/12-ping-pong.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to write this as an integration test... But couldn't make it play ball.

# THIS FILE IS PART OF THE CYLC WORKFLOW ENGINE.
# Copyright (C) NIWA & British Crown (Met Office) & Contributors.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#-----------------------------------------------------------------------------

# If a task has a platform set using a subshell this should be evaluated
# every time the task is run.
# https://github.com/cylc/cylc-flow/issues/6808
export REQUIRE_PLATFORM='loc:remote fs:indep'

. "$(dirname "$0")/test_header"

set_test_number 3

create_test_global_config "" "
[platforms]
[[${CYLC_TEST_HOST}]]
hosts = ${CYLC_TEST_HOST}
install target = ${CYLC_TEST_HOST}
"

install_workflow "${TEST_NAME_BASE}" "${TEST_NAME_BASE}"

cat > var_file <<__HERE__
REMOTE_HOST="${CYLC_TEST_HOST}"
__HERE__

workflow_run_ok "${TEST_NAME_BASE}-run" \
cylc play "${WORKFLOW_NAME}" --debug --no-detach --set-file var_file

named_grep_ok "1/remote_task submits to ${CYLC_TEST_HOST}" \
"\[1/remote_task/01:preparing\] submitted to ${CYLC_TEST_HOST}" \
"${WORKFLOW_RUN_DIR}/log/scheduler/log"

named_grep_ok "2/remote_task submits to localhost" \
"\[2/remote_task/01:preparing\] submitted to localhost" \
"${WORKFLOW_RUN_DIR}/log/scheduler/log"

44 changes: 44 additions & 0 deletions tests/functional/platforms/12-ping-pong/flow.cylc
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!jinja2
[scheduler]
[[events]]
stall timeout = PT0S

[scheduling]
cycling mode = integer
final cycle point = 2
[[graph]]
P1 = remote_task[-P1] => toggler => remote_task

[runtime]
[[remote_task]]
platform = $(cat ${CYLC_WORKFLOW_RUN_DIR}/pretend-hall-info)

[[toggler]]
script = """
# Toggle the platform between localhost and the remote host
# using the content of a file, ${TEST_DIR}/pretend-hall-info.
# between localhost and the remote.

A="localhost"
B="${REMOTE_HOST}"

# Create pretend-hall-info file if it does not exist
if [[ ! -f ${TEST_DIR}/pretend-hall-info ]]; then
echo "localhost" > ${TEST_DIR}/pretend-hall-info
fi

# Read the current platform from the file and toggle it
this=$(cat ${TEST_DIR}/pretend-hall-info)
echo ${TEST_DIR} $this
echo "content of pretent-hall-info: $(cat ${TEST_DIR}/pretend-hall-info)"
if [[ $this == "localhost" ]]; then
echo ${REMOTE_HOST} > ${TEST_DIR}/pretend-hall-info
cylc message -- "WARNING:changing platform to ${REMOTE_HOST}"
else
echo "localhost" > ${TEST_DIR}/pretend-hall-info
cylc message -- "WARNING:changing platform to localhost"
fi
"""
[[[environment]]]
REMOTE_HOST = {{ CYLC_TEST_HOST }}
TEST_DIR = ${CYLC_WORKFLOW_RUN_DIR}
Loading