Skip to content

message: network glitches can cause custom outputs to be missed #7257

@oliver-sanders

Description

@oliver-sanders

The cylc message command attempts to contact the scheduler (push messaging) if configured to do so.

However, it also backs-up the message up the message in thejob.status file where it can be accessed by polling. This means that message receipt is guaranteed irrespective of any network glitches.

This guarantee holds for the started message as well final task outputs (expired, succeeded and failed), however, it does not hold for custom messages.

Reproducible example

[scheduling]
    [[graph]]
        R1 = a:x => b

[runtime]
    [[a]]
        script = """
            # wait for the started message
            cylc__job__wait_cylc_message_started

            # disable push-messaging
            mv "${CYLC_WORKFLOW_RUN_DIR}/.service/contact" contact.safe

            # send the custom message
            cylc message -- xxx

            # restore push-messaging
            mv contact.safe "${CYLC_WORKFLOW_RUN_DIR}/.service/contact"
        """
        execution polling intervals = PT30S, 10*PT1S
        [[[outputs]]]
            x = xxx
    [[b]]

Order of events:

  • started message is sent and received by the scheduler.
  • xxx message sending fails.
  • succeeded message is sent and received by the scheduler.

The xxx message is never received by the scheduler resulting in an incomplete task.

A cylc poll command will rectify the solution, causing the custom message to be returned to the scheduler. This is a handy workaround, but polling will not happen automatically so you have to know this has happened.

However, polling only works because the task is incomplete, if we modify this example like so:

R1 = """
a:x? => b
a => sleep
"""

Then the task cannot be polled after completing.

Solutions

There are a few ways we could solve this situation restoring guaranteed receipt for custom messages:

  1. Send an incrementing ID with each message:
    • The Cylc SSH/TCP client would add a metadata field for the message ID.
    • E.g, this ID could be an integer, starting at 1 and incrementing with each message (determined say by grepping the job.status file).
    • In this example, the started message would get the ID 1, the xxx message 2, the succeeded message 3.
    • The scheduler would track the highest number it has received (backing this up in the database).
    • If the scheduler receives message 1 followed by 3, then it knows a message is missing and should schedule a poll.
  2. Use a single cylc message client for all messages:
    • ZMQ can provide guarantees over message order when multiple messages are sent via the same client.
    • So if we used a single client for all messages, we may be able to prevent this race condition from happening at the network layer.
    • This is tricky because custom messages may be sent by scripts via the CLI.
    • However, we could drop the network part of cylc message (leaving it to just write to the job.status file).
    • And have a listener subprocess of the job script which polls the job.status file and performs the network portion of the work.
    • Note, we now have a pattern for running a Python service alongside the job script courtesy of the cylc profiler.
  3. Other?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is wrong :(

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions