message: network glitches can cause custom outputs to be missed

The `cylc message` command attempts to contact the scheduler (push messaging) if configured to do so.

However, it also backs-up the message up the message in the`job.status` file where it can be accessed by polling. This means that message receipt is guaranteed irrespective of any network glitches.

This guarantee holds for the `started` message as well final task outputs (`expired`, `succeeded` and `failed`), however, it does not hold for custom messages.

### Reproducible example

```cylc
[scheduling]
    [[graph]]
        R1 = a:x => b

[runtime]
    [[a]]
        script = """
            # wait for the started message
            cylc__job__wait_cylc_message_started

            # disable push-messaging
            mv "${CYLC_WORKFLOW_RUN_DIR}/.service/contact" contact.safe

            # send the custom message
            cylc message -- xxx

            # restore push-messaging
            mv contact.safe "${CYLC_WORKFLOW_RUN_DIR}/.service/contact"
        """
        execution polling intervals = PT30S, 10*PT1S
        [[[outputs]]]
            x = xxx
    [[b]]
```

Order of events:

* `started` message is sent and received by the scheduler.
* `xxx` message sending fails.
* `succeeded` message is sent and received by the scheduler.

The `xxx` message is never received by the scheduler resulting in an incomplete task.

A `cylc poll` command will rectify the solution, causing the custom message to be returned to the scheduler. This is a handy workaround, but polling will not happen automatically so you have to know this has happened.

However, polling only works because the task is incomplete, if we modify this example like so:

```cylc
R1 = """
a:x? => b
a => sleep
"""
```

Then the task cannot be polled after completing.


### Solutions

There are a few ways we could solve this situation restoring guaranteed receipt for custom messages:

1. Send an incrementing ID with each message:
   * The Cylc SSH/TCP client would add a metadata field for the message ID.
   * E.g, this ID could be an integer, starting at `1` and incrementing with each message (determined say by grepping the job.status file).
   * In this example, the started message would get the ID `1`, the `xxx` message `2`, the succeeded message `3`.
   * The scheduler would track the highest number it has received (backing this up in the database).
   * If the scheduler receives message `1` followed by `3`, then it knows a message is missing and should schedule a poll.
2. Use a single `cylc message` client for all messages:
   * ZMQ can provide guarantees over message order when multiple messages are sent via the same client.
   * So if we used a single client for all messages, we may be able to prevent this race condition from happening at the network layer.
   * This is tricky because custom messages may be sent by scripts via the CLI.
   * However, we could drop the network part of `cylc message` (leaving it to just write to the `job.status` file).
   * And have a listener subprocess of the job script which polls the `job.status` file and performs the network portion of the work.
   * Note, we now have a pattern for running a Python service alongside the job script courtesy of the `cylc profiler`.
3. Other?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

message: network glitches can cause custom outputs to be missed #7257

Reproducible example

Solutions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

message: network glitches can cause custom outputs to be missed #7257

Description

Reproducible example

Solutions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions