-
Notifications
You must be signed in to change notification settings - Fork 95
message: network glitches can cause custom outputs to be missed #7257
Description
The cylc message command attempts to contact the scheduler (push messaging) if configured to do so.
However, it also backs-up the message up the message in thejob.status file where it can be accessed by polling. This means that message receipt is guaranteed irrespective of any network glitches.
This guarantee holds for the started message as well final task outputs (expired, succeeded and failed), however, it does not hold for custom messages.
Reproducible example
[scheduling]
[[graph]]
R1 = a:x => b
[runtime]
[[a]]
script = """
# wait for the started message
cylc__job__wait_cylc_message_started
# disable push-messaging
mv "${CYLC_WORKFLOW_RUN_DIR}/.service/contact" contact.safe
# send the custom message
cylc message -- xxx
# restore push-messaging
mv contact.safe "${CYLC_WORKFLOW_RUN_DIR}/.service/contact"
"""
execution polling intervals = PT30S, 10*PT1S
[[[outputs]]]
x = xxx
[[b]]Order of events:
startedmessage is sent and received by the scheduler.xxxmessage sending fails.succeededmessage is sent and received by the scheduler.
The xxx message is never received by the scheduler resulting in an incomplete task.
A cylc poll command will rectify the solution, causing the custom message to be returned to the scheduler. This is a handy workaround, but polling will not happen automatically so you have to know this has happened.
However, polling only works because the task is incomplete, if we modify this example like so:
R1 = """
a:x? => b
a => sleep
"""Then the task cannot be polled after completing.
Solutions
There are a few ways we could solve this situation restoring guaranteed receipt for custom messages:
- Send an incrementing ID with each message:
- The Cylc SSH/TCP client would add a metadata field for the message ID.
- E.g, this ID could be an integer, starting at
1and incrementing with each message (determined say by grepping the job.status file). - In this example, the started message would get the ID
1, thexxxmessage2, the succeeded message3. - The scheduler would track the highest number it has received (backing this up in the database).
- If the scheduler receives message
1followed by3, then it knows a message is missing and should schedule a poll.
- Use a single
cylc messageclient for all messages:- ZMQ can provide guarantees over message order when multiple messages are sent via the same client.
- So if we used a single client for all messages, we may be able to prevent this race condition from happening at the network layer.
- This is tricky because custom messages may be sent by scripts via the CLI.
- However, we could drop the network part of
cylc message(leaving it to just write to thejob.statusfile). - And have a listener subprocess of the job script which polls the
job.statusfile and performs the network portion of the work. - Note, we now have a pattern for running a Python service alongside the job script courtesy of the
cylc profiler.
- Other?