Skip to content

Create a custom thrift protocol to ensure client/server compatibility #5398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

DomGarguilo
Copy link
Member

Addresses #5390

@DomGarguilo DomGarguilo added this to the 4.0.0 milestone Mar 11, 2025
@DomGarguilo DomGarguilo requested a review from ctubbsii March 11, 2025 20:06
@DomGarguilo DomGarguilo self-assigned this Mar 11, 2025
@DomGarguilo
Copy link
Member Author

Created this in draft since, for now, it only addresses the task outlined in the ticket:

create a custom protocol with the versioning information stored in a header and checks on the server side to validate it

I'm not sure if any of the other proposed tasks should be added here or as follow-on PRs.

public static class AccumuloProtocol extends TCompactProtocol {

private static final int MAGIC_NUMBER = 0x41434355; // "ACCU" in ASCII
private static final byte PROTOCOL_VERSION = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a comment to AccumuloDataVersion.CURRENT_VERSION that references this and mentions that when changing CURRENT_VERSION may also need to change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that... whether the protocol version should be independent, so we can version the serialization format of the protocol header.

We will probably want to include the AccumuloDataVersion.CURRENT_VERSION in the protocol header directly, though. We could also include the actual Accumulo release version in here, too (major.minor.patch; e.g. 4.0.0). I think to verify, we'd probably want to ensure the same major and same minor version, if we want to be even more restrictive than the data version. We really shouldn't have Accumulo servers or clients/servers talking across major.minor versions (different patch releases are okay, and is important for rolling upgrades).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it separate for now just in case we wanted to have multiple protocol versions able to talk to each other but am not sure about that.

The data version and protocol version definitely seem related but I'm not sure how closely tied we want to have those versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the protocol version can be independently numbered, not connected to the data version. I think it's sufficient to check the Accumulo version (Constants.VERSION) major.minor. We don't need to use the data version in here at all. The protocol version is the version of the RPC protocol... the data version is the version of the data persisted in ZK and HDFS... and the Accumulo version is the overall version of the software. Only the RPC protocol version and the Accumulo version are relevant here.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we actually need to distinguish between client and server for the spans.

@DomGarguilo
Copy link
Member Author

Marking this as ready for review. I tried to address the following two suggestions from the ticket:

  1. Primarily, create a custom protocol with the versioning information stored in a header and checks on the server side to validate it, then
  2. Secondarily, simplify the thrift APIs by adding tracing info to the protocol header

There are a lot of changed files in this PR since I had to update the thrift code. There are also a lot of places where the only change is the removal of the TInfo object as the first param in a method call which was a trivial change. Hopefully that helps skip/skim over portions of the diff while reviewing.

@DomGarguilo DomGarguilo marked this pull request as ready for review March 17, 2025 20:57
@DomGarguilo
Copy link
Member Author

@kevinrr888 your suggestions should all be addressed as of 29c1234

in that commit I...

  • refactored the protocol version compatibility check method to throw an exception directly instead of return a boolean. I also updated the wording on that message to make things more clear
  • addressed the IDE warnings regarding nullness in various places
  • added javadoc to the static client and server factory methods in ThriftUtil to make it clearer which should be used

@keith-turner
Copy link
Contributor

keith-turner commented Apr 4, 2025

@DomGarguilo did not realize merging the ClientTabletCache changes would conflict w/ this. Makes sense though w/ all the TInfo changes. Would probably be good to get this merged in, as it will continue to have conflicts, the testing that was done sounds really good.

@DomGarguilo
Copy link
Member Author

@DomGarguilo did not realize merging the ClientTabletCache changes would conflict w/ this. Makes sense though w/ all the TInfo changes. Would probably be good to get this merged in, as it will continue to have conflicts, the testing that was done sounds really good.

I'm fine with merging it in or leaving it open. The merge conflicts are pretty easy to resolve here since its just a param change in most files. Also, @ctubbsii left a comment about wanting to review soon.

Comment on lines 197 to 198
Span span = Span.current(); // should be set by protocol
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think that what the old code was doing when it called startServerRpcSpan was that it was always creating a new "inner span" inside the outer span that that the user might have created in the client. That was useful because it helped us distinguish the a child or inner span that was just associated with the RPC call, rather than be grouped with all the other things the client code might have been doing around the RPC call.

This code, however, seems to just grab the current span the user created, and attach an exception to it if an RPC exception occurred. I think we should do the other thing. If there's an exception, it should be attached to the inner/child span for the RPC, not the outer span the user created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think I fixed this in 8d742ea. Please take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may have misunderstood what you were doing before. I think you might have had it correct before. I didn't realize you were creating the server span in the protocol, prior to this invocation handler being called. The expectation, I think, was that the invocation handler just needed to get a reference to the span started by the protocol, for the purposes of attaching any exceptions to it. I think I will need to look at this more closely, but I am now thinking I misled you by my earlier comment, and you were doing it correctly before.

@ctubbsii
Copy link
Member

ctubbsii commented Apr 8, 2025

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

@DomGarguilo
Copy link
Member Author

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

I was trying to work this in but am not sure the best place to pull in the instanceId from. I tried to pass it in as a parameter but had to reach pretty high up the chain which changed a lot of code. Was hoping you had a better idea for where I can add this in from.

@ctubbsii
Copy link
Member

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

I was trying to work this in but am not sure the best place to pull in the instanceId from. I tried to pass it in as a parameter but had to reach pretty high up the chain which changed a lot of code. Was hoping you had a better idea for where I can add this in from.

The ServerContext and/or ClientContext should be able to provide the InstanceId, and the protocol factory could be constructed with it. I'm not sure how much code that touches. I can try to take a look if you hit a wall here.

@DomGarguilo
Copy link
Member Author

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

I was trying to work this in but am not sure the best place to pull in the instanceId from. I tried to pass it in as a parameter but had to reach pretty high up the chain which changed a lot of code. Was hoping you had a better idea for where I can add this in from.

The ServerContext and/or ClientContext should be able to provide the InstanceId, and the protocol factory could be constructed with it. I'm not sure how much code that touches. I can try to take a look if you hit a wall here.

Okay I added the instance ID validation in 769d38f. Had to reach pretty far up the chain to get the instanceId so there were quite a few files changed but all of those are just parameter changes.

@DomGarguilo DomGarguilo requested a review from ctubbsii April 16, 2025 19:49
@ctubbsii
Copy link
Member

Can you provide an example of the opentelemetry tracing output (screenshot or log sequences or something)? I'd be curious what they look like when a client performs a traced operation over the new RPC protocol.

@DomGarguilo
Copy link
Member Author

Can you provide an example of the opentelemetry tracing output (screenshot or log sequences or something)? I'd be curious what they look like when a client performs a traced operation over the new RPC protocol.

I'm not too sure how much this shows but here is a screenshot from the "graph view" in Jaeger capturing traces from a scan from the shell
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a custom thrift protocol to ensure client/server compatibility
4 participants