Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap remote errors with cluster name to provide more context #123156

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

pawankartik-elastic
Copy link
Contributor

@pawankartik-elastic pawankartik-elastic commented Feb 21, 2025

Previously, if a remote encountered an error, it'd fail and provide the stacktrace to the user. However, this info does not mention name of the remote. This PR attemps to provide this context.

Here's why I introduced a new exception:

  1. ElasticsearchException is too generic and returns a status code 500 (since it's causes will not be unwrapped),
  2. Other exceptions like SearchException cover only a subset of all the errors that could be thrown at this point (and it's meant specifically for a search error originating within a shard), and,
  3. Cannot use any existing wrappers as wrappers get unwrapped when sending the error back to the user (which loses the context we've built up specifically for the user).

Action items:

  • Verify if the exception can be serialised over the wire, and,
  • Check for any concerns wrt backwards compatibility.

@pawankartik-elastic pawankartik-elastic added :Search Foundations/Search Catch all for Search Foundations >enhancement auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.0.1 labels Mar 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @pawankartik-elastic, I've created a changelog YAML for you.

@pawankartik-elastic
Copy link
Contributor Author

Okay, so as expected, the breakages are primarily around security exceptions, and task cancellations.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@@ -300,6 +301,7 @@ public void execute(
cancelQueryOnFailure,
execInfo,
computeListener.acquireCompute()
.delegateResponse((l, ex) -> l.onFailure(new RemoteComputeException(cluster.clusterAlias(), ex)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with the ComputeService to determine whether this exception is a local only exception, that will never be serialized through the wire. Is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's on my mind right now and I hope to get it confirmed with Nhat later today. Sounds good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically startComputeOnRemoteCluster is always called in the coordinating node. I am not sure however whether it means "it will be never serialized", as there seem to be scenarios - e.g. with async response - where the end result is serialized, and in that case this exception might have to be serialized too, I am not sure.

Copy link
Member

@dnhatn dnhatn Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Stas said is correct. This exception is local with a sync query but can be serialized with an async query:

.

Maybe add an async query with failures to verify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, if it can be serialized then it needs to be registered as a serializable one, which makes me wonder if we can reuse an existing one instead to avoid that ceremony :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the catch! Yes, the exception is getting serialised for asynchronous queries and has to be handled accordingly.

wonder if we can reuse an existing one instead to avoid that ceremony

To re-use an existing exception, we primarily need to fulfil 2 requirements:

  1. It should propagate the status of the cause, and,
  2. It should not implement the ES wrapper interface to prevent unwrapping when the error is sent back to the user (which discards the context we've built up, i.e. the remote's name).

I don't see any exceptions that we can reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.0.1 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants