Skip to content

Improve server-client communication error handling#6578

Merged
oliver-sanders merged 7 commits intocylc:8.4.xfrom
MetRonnie:graphql-err-handling
Feb 25, 2025
Merged

Improve server-client communication error handling#6578
oliver-sanders merged 7 commits intocylc:8.4.xfrom
MetRonnie:graphql-err-handling

Conversation

@MetRonnie
Copy link
Copy Markdown
Member

@MetRonnie MetRonnie commented Jan 28, 2025

In #6499 we added a new field to the trigger GraphQL mutation, and this unfortunately broke the ability for cylc trigger at 8.4 to work on workflows running in 8.3 (I shall term this "inter-version server-client comms").

This PR does not fix that (we don't think a fix is feasible at this stage).

An additional problem is that the error handling for server-client comms is broken - that is what this PR fixes. And it attempts to ensure we have defined format for server-client comms which should reduce the chance of breaking inter-version server-client comms in future.

Before

# Workflow running at Cylc 8.3.6, CLI at Cylc 8.4.0
$ cylc trigger wflow//1/foo
Error processing command:
    AttributeError: 'list' object has no attribute 'values'

After

# Workflow running at Cylc 8.3.6, CLI at this Cylc <this PR>
$ cylc trigger wflow//1/foo
Error processing command:
    Exception: This command is no longer compatible with the version of Cylc running the workflow. Please stop & restart the workflow with Cylc 8.4.1.dev or higher.

[{'error': {'message': 'Unknown argument "onResume" on field "trigger" of type "Mutations".', 'traceback': ['graphql.error.base.GraphQLError: Unknown argument "onResume" on field "trigger" of type "Mutations".\n']}}]

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • No dependency changes
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • No docs needed
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@MetRonnie MetRonnie added the bug Something is wrong :( label Jan 28, 2025
@MetRonnie MetRonnie added this to the 8.4.1 milestone Jan 28, 2025
@MetRonnie MetRonnie self-assigned this Jan 28, 2025
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 5dd62b7 to ee3102d Compare January 28, 2025 17:19
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from ee3102d to ff80511 Compare January 29, 2025 16:24
client = WorkflowRuntimeClient(one.workflow)
with monkeypatch.context() as mp:
mp.setattr(client, 'socket', Mock(recv=mock_recv))
mp.setattr(client, 'poller', Mock())
Copy link
Copy Markdown
Member Author

@MetRonnie MetRonnie Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future/historical reference, I originally tried mocking the poll method of client.poller, i.e.

mp.setattr(client.poller, 'poll', Mock())

but this resulted in pytest hanging near the end of running all integration tests; some strange interaction due to the ZMQ poller using threading or something, dunno.

Anyway, the reason for this monkeypatching is to get client.poller.poll() to return a truthy value so the socket can receive the mock response

# receive response
if self.poller.poll(timeout):
res = await self.socket.recv()

Copy link
Copy Markdown
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made two suggestions, but neither should block this PR. :)

@MetRonnie MetRonnie requested a review from wxtim February 11, 2025 15:49
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand what bug this fixes.

The changes I can see:

  • messages/traceback field removed
  • cylc_version field added
  • Errors now force exceptions in multi mode (was this the bug?).
  • Refactoring.
  • Nicer error for the trigger incompatibility issue.

@MetRonnie
Copy link
Copy Markdown
Member Author

MetRonnie commented Feb 20, 2025

Can you help me understand what bug this fixes

Before

# Workflow running at Cylc 8.3.6, CLI at Cylc 8.4.0
$ cylc trigger wflow//1/foo
Error processing command:
    AttributeError: 'list' object has no attribute 'values'

After

# Workflow running at Cylc 8.3.6, CLI at this Cylc <this PR>
$ cylc trigger wflow//1/foo
Error processing command:
    Exception: This command is no longer compatible with the version of Cylc running the workflow. Please stop & restart the workflow with Cylc 8.4.1.dev or higher.

[{'error': {'message': 'Unknown argument "onResume" on field "trigger" of type "Mutations".', 'traceback': ['graphql.error.base.GraphQLError: Unknown argument "onResume" on field "trigger" of type "Mutations".\n']}}]

(Added to OP)

@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 0fac2c7 to f751949 Compare February 20, 2025 16:31
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from b81932c to 9505212 Compare February 21, 2025 12:58
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 9505212 to 43cab17 Compare February 21, 2025 15:44
@oliver-sanders oliver-sanders self-requested a review February 24, 2025 11:53
@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 24, 2025

I think this PR is stripping away exception context?

I tried creating an internal error with this diff:

diff --git a/cylc/flow/network/resolvers.py b/cylc/flow/network/resolvers.py
index 79b91f97e..702b3f573 100644
--- a/cylc/flow/network/resolvers.py
+++ b/cylc/flow/network/resolvers.py
@@ -787,6 +787,7 @@ class Resolvers(BaseResolvers):
         cutoff: Any = None
     ):
         """Put or clear broadcasts."""
+        raise ValueError('Unsupported broadcast mode')
         if settings is not None:
             # Convert schema field names to workflow config setting names if
             # applicable:

Results:

# scheduler=8.4.0, client=*
ERROR: [{'error': {'message': 'Unsupported broadcast mode', 'traceback': ['graphql.error.located_error.GraphQLLocatedError: Unsupported broadcast mode\n']}}]

# scheduler=this branch, client=*
ERROR: {'broadcast': None}

@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from aa12785 to 1415531 Compare February 24, 2025 15:22
@MetRonnie
Copy link
Copy Markdown
Member Author

Should be sorted in 1415531

}
but this is not 100% consistent unfortunately
"""
error: Union[Exception, str, dict]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is error an Exception or a str?

Copy link
Copy Markdown
Member Author

@MetRonnie MetRonnie Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tried to change it to Exception | str to begin with, to simplify things, but realised it breaks inter-version compatibility. So presently I don't think it ever is an Exception or str. However, with these changes we should be able to change it to be Exception | str in future without breaking inter-version compatibility with this version onwards.

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 25, 2025

Using a diff along the lines of this:

diff --git a/cylc/flow/network/resolvers.py b/cylc/flow/network/resolvers.py
index 79b91f97e..702b3f573 100644
--- a/cylc/flow/network/resolvers.py
+++ b/cylc/flow/network/resolvers.py
@@ -787,6 +787,7 @@ class Resolvers(BaseResolvers):
         cutoff: Any = None
     ):
         """Put or clear broadcasts."""
+        raise ValueError('Unsupported broadcast mode')
         if settings is not None:
             # Convert schema field names to workflow config setting names if
             # applicable:

We will now get an error something like this:

ClientError: {'message': 'Unsupported broadcast mode'}
  • We are reporting the string ClientError for an error that was received from the server?! Doesn't really make sense.
  • We are presenting the error as a dictionary (this is the same as the previous behaviour, but can be easily improved).

This patch changes the string ClientError to RequestError and uses the message field from the dictionary rather than the whole dictionary:

diff --git a/cylc/flow/exceptions.py b/cylc/flow/exceptions.py
index 802cfaaa9..a3eff7388 100644
--- a/cylc/flow/exceptions.py
+++ b/cylc/flow/exceptions.py
@@ -285,6 +285,10 @@ class ClientError(CylcError):
         return ret

+class RequestError(ClientError):
+    """Represents an error returned by the server."""
+
+
 class WorkflowStopped(ClientError):
     """The Cylc scheduler you attempted to connect to is stopped."""
diff --git a/cylc/flow/network/client.py b/cylc/flow/network/client.py
index 656aefd5c..89dac4cbb 100644
--- a/cylc/flow/network/client.py
+++ b/cylc/flow/network/client.py
@@ -40,10 +40,10 @@ from cylc.flow import (
     __version__ as CYLC_VERSION,
 )
 from cylc.flow.exceptions import (
-    ClientError,
     ClientTimeout,
     ContactFileExists,
     CylcError,
+    RequestError,
     WorkflowStopped,
 )
 from cylc.flow.hostuserutil import get_fqdn_by_host
@@ -332,17 +332,23 @@ class WorkflowRuntimeClient(  # type: ignore[misc]
             return response['data']
         except KeyError:
             error = response.get('error')
-            if not error:
-                error = (
+            error_mesage: str
+            if isinstance(error, (str, Exception)):
+                error_mesage = str(error)
+            elif error is None:
+                error_mesage = (
                     f"Received invalid response for Cylc {CYLC_VERSION}: "
                     f"{response}"
                 )
                 wflow_cylc_ver = response.get('cylc_version')
                 if wflow_cylc_ver and wflow_cylc_ver != CYLC_VERSION:
-                    error += (
+                    error_mesage += (
                         f"\n(Workflow is running in Cylc {wflow_cylc_ver})"
                     )
-            raise ClientError(str(error)) from None
+            else:
+                error_mesage = error.get('message', 'ERROR')
+
+            raise RequestError(error_mesage) from None
 
     def get_header(self) -> dict:
         """Return "header" data to attach to each request for traceability.

@oliver-sanders oliver-sanders merged commit d8c355a into cylc:8.4.x Feb 25, 2025
26 of 27 checks passed
@MetRonnie MetRonnie deleted the graphql-err-handling branch February 25, 2025 14:44
wxtim added a commit to wxtim/cylc that referenced this pull request Mar 5, 2025
…n_cylc_show

* upstream/master: (27 commits)
  tui: open log files in external application (cylc#6611)
  fix typo in stop.py - options.max_poll -> options.max_polls
  schema: add first-parent descendants (cylc#6610)
  doc: improve NamespaceIDGlob description (cylc#6637)
  Document that "ex" prefix means "exclude" [skip ci]
  tests: remove string templating in SQL statements (cylc#6631)
  Merge pull request cylc#6629 from wxtim/tests.flake8
  Remove vestiges of authorisation layer removed in cylc#3845
  Bump dev version
  Prepare release 8.4.1
  Merge pull request cylc#6578 from MetRonnie/graphql-err-handling
  Add test
  Update cylc/flow/etc/examples/expiry/index.rst
  Correct type annotations
  Fix duplicate task triggers
  Add test for duplicate task triggers
  Pytest: full verbosity for assertions
  Fix test picking up user global config
  Wrapper script: fix `PATH` override preventing selection of Cylc version in GUI under Cylc Hub
  Get poll to return task failure if job/log has been removed. (cylc#6577)
  ...
wxtim added a commit to wxtim/cylc that referenced this pull request Mar 12, 2025
* upstream/8.4.x:
  Ensure `cylc message` exceptions are printed to stderr (cylc#6647)
  fix typo in stop.py - options.max_poll -> options.max_polls
  doc: improve NamespaceIDGlob description (cylc#6637)
  Document that "ex" prefix means "exclude" [skip ci]
  tests: remove string templating in SQL statements (cylc#6631)
  Bump dev version
  Prepare release 8.4.1
  Merge pull request cylc#6578 from MetRonnie/graphql-err-handling
  Update cylc/flow/etc/examples/expiry/index.rst
  Pytest: full verbosity for assertions
  Fix test picking up user global config
  Wrapper script: fix `PATH` override preventing selection of Cylc version in GUI under Cylc Hub
  Get poll to return task failure if job/log has been removed. (cylc#6577)
  work around NFS caching issues (cylc#6603)
  allow workflow config to unset global events config (cylc#6518)
  examples: document expiry workflow design patterns
wxtim added a commit to wxtim/cylc that referenced this pull request Mar 13, 2025
* upstream/8.4.x:
  Ensure `cylc message` exceptions are printed to stderr (cylc#6647)
  fix typo in stop.py - options.max_poll -> options.max_polls
  doc: improve NamespaceIDGlob description (cylc#6637)
  Document that "ex" prefix means "exclude"
  tests: remove string templating in SQL statements (cylc#6631)
  Bump dev version
  Prepare release 8.4.1
  Merge pull request cylc#6578 from MetRonnie/graphql-err-handling
  Update cylc/flow/etc/examples/expiry/index.rst
  Pytest: full verbosity for assertions
  Fix test picking up user global config
  Wrapper script: fix `PATH` override preventing selection of Cylc version in GUI under Cylc Hub
  Get poll to return task failure if job/log has been removed. (cylc#6577)
  work around NFS caching issues (cylc#6603)
  allow workflow config to unset global events config (cylc#6518)
  examples: document expiry workflow design patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is wrong :(

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants