Skip to content

ES|QL: Wrap remote errors with cluster name to provide more context #123156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pawankartik-elastic
Copy link
Contributor

@pawankartik-elastic pawankartik-elastic commented Feb 21, 2025

Previously, if a remote encountered an error, it'd fail and provide the stacktrace to the user. However, this info does not mention name of the remote. This PR attemps to provide this context.

Here's why I introduced a new exception:

  1. ElasticsearchException is too generic and returns a status code 500 (since it's causes will not be unwrapped),
  2. Other exceptions like SearchException cover only a subset of all the errors that could be thrown at this point (and it's meant specifically for a search error originating within a shard), and,
  3. Cannot use any existing wrappers as wrappers get unwrapped when sending the error back to the user (which loses the context we've built up specifically for the user).

Action items:

  • Verify if the exception can be serialised over the wire, and,
  • Check for any concerns wrt backwards compatibility.

Assuming an exception of type <exception type> is thrown, the response without wrapping looks like:

{
    "error": {
        "root_cause": [
            {
                "type": "<exception type>",
                "reason": "<exception message>"
            }
        ],
        "type": "<exception type>",
        "reason": "<exception message>",
        "suppressed": [
            {
                // Suppressed stack trace
            }
        ]
    },
    "status": <appropriate error code that represents cause>
}

With wrapping:

{
    "error": {
        "root_cause": [
            {
                "type": "remote_exception",
                "reason": "Remote [remote1] encountered an error",
                "suppressed": [
                    {
                       // Suppressed stack trace
                    }
                ]
            }
        ],
        "type": "remote_exception",
        "reason": "Remote [remote1] encountered an error",
        "caused_by": {
            "type": "<exception type>",
            "reason": "<exception message>",
        },
        "suppressed": [
            {
                // Suppressed stack trace
            }
        ]
    },
    "status": <appropriate error code that represents cause>
}

@pawankartik-elastic pawankartik-elastic added :Search Foundations/Search Catch all for Search Foundations >enhancement auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.0.1 labels Mar 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @pawankartik-elastic, I've created a changelog YAML for you.

@pawankartik-elastic
Copy link
Contributor Author

Okay, so as expected, the breakages are primarily around security exceptions, and task cancellations.

@pawankartik-elastic pawankartik-elastic marked this pull request as ready for review March 13, 2025 10:05
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@@ -300,6 +301,7 @@ public void execute(
cancelQueryOnFailure,
execInfo,
computeListener.acquireCompute()
.delegateResponse((l, ex) -> l.onFailure(new RemoteComputeException(cluster.clusterAlias(), ex)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with the ComputeService to determine whether this exception is a local only exception, that will never be serialized through the wire. Is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's on my mind right now and I hope to get it confirmed with Nhat later today. Sounds good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically startComputeOnRemoteCluster is always called in the coordinating node. I am not sure however whether it means "it will be never serialized", as there seem to be scenarios - e.g. with async response - where the end result is serialized, and in that case this exception might have to be serialized too, I am not sure.

Copy link
Member

@dnhatn dnhatn Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Stas said is correct. This exception is local with a sync query but can be serialized with an async query:

.

Maybe add an async query with failures to verify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, if it can be serialized then it needs to be registered as a serializable one, which makes me wonder if we can reuse an existing one instead to avoid that ceremony :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the catch! Yes, the exception is getting serialised for asynchronous queries and has to be handled accordingly.

wonder if we can reuse an existing one instead to avoid that ceremony

To re-use an existing exception, we primarily need to fulfil 2 requirements:

  1. It should propagate the status of the cause, and,
  2. It should not implement the ES wrapper interface to prevent unwrapping when the error is sent back to the user (which discards the context we've built up, i.e. the remote's name).

I don't see any exceptions that we can reuse.

user and move `unwrapIfWrappedInRemoteComputeException` to `EsqlTestUtils`
@pawankartik-elastic pawankartik-elastic changed the title Wrap remote errors with cluster name to provide more context ES|QL: Wrap remote errors with cluster name to provide more context Apr 1, 2025
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for doing this!

@pawankartik-elastic pawankartik-elastic merged commit e4fb22c into elastic:main Apr 2, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 2, 2025

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 123156

@pawankartik-elastic
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

pawankartik-elastic added a commit to pawankartik-elastic/elasticsearch that referenced this pull request Apr 2, 2025
…lastic#123156)

Wrap remote errors with cluster name to provide more context

Previously, if a remote encountered an error, user would see a top-level error that would provide no context about which remote ran into the error. Now, such errors are wrapped in a separate remote exception whose error message clearly specifies the name of the remote cluster and the error that occurred is the cause of this remote exception.

(cherry picked from commit e4fb22c)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/EsqlTestUtils.java
pawankartik-elastic added a commit that referenced this pull request Apr 2, 2025
…123156) (#126165)

Wrap remote errors with cluster name to provide more context

Previously, if a remote encountered an error, user would see a top-level error that would provide no context about which remote ran into the error. Now, such errors are wrapped in a separate remote exception whose error message clearly specifies the name of the remote cluster and the error that occurred is the cause of this remote exception.

(cherry picked from commit e4fb22c)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/EsqlTestUtils.java
andreidan pushed a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2025
…lastic#123156)

Wrap remote errors with cluster name to provide more context

Previously, if a remote encountered an error, user would see a top-level error that would provide no context about which remote ran into the error. Now, such errors are wrapped in a separate remote exception whose error message clearly specifies the name of the remote cluster and the error that occurred is the cause of this remote exception.
@pawankartik-elastic pawankartik-elastic deleted the pkar/esql-wrap-remote-errors branch June 26, 2025 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants