Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Leverage protobuf for serializing select node-to-node objects #15308

Open
3 tasks
finnegancarroll opened this issue Aug 19, 2024 · 6 comments
Open
3 tasks
Assignees
Labels
Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label Search:Performance v2.17.0 v2.18.0 Issues and PRs related to version 2.18.0

Comments

@finnegancarroll
Copy link
Contributor

finnegancarroll commented Aug 19, 2024

Please describe the end goal of this project

Provide a framework and migrate some key node-to-node requests to serialize via protobuf rather than the current native stream writing (i.e. writeTo(StreamOutput out)). Starting with node-to-node transport messages maintains api compatibility while introducing protobuf implementations which have several benefits:

  • Out of the box backwards compatibility/versioning
  • Speculative performance enhancements in some areas
  • Supports future gRPC work

Supporting References

Issues

Related component

Search:Performance

@dbwiddis
Copy link
Member

dbwiddis commented Aug 24, 2024

Having spent way too much time diving into the byte-level (or bit-level if you consider 7-bit Vints) details of these protocols, I want to make sure we're focusing on the correct things here.

Protobuf isn't some magical thing that brings about 30% improvement the way the POC implemented it, and I don't think the existing POCs tested it fully vs. the existing capabilities of the transport protocol if used properly.

In summary:

  • I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information
    • Existing StreamInput/StreamOutput already contains the capability to use variable-length ints/longs, they're just underused
    • The existing Transport Protocol using Vint/Vlong is marginally (but likely not significant) more efficient (in bandwidth terms) than protobuf when dealing with unsigned values, and Zint/Zlong is equivalent to protobuf signed values.
  • Broadly speaking, protobuf can be marginally more efficient in the case where there are a lot more "null" or "empty" values (you gain a byte in each case of a missing value, but lose a byte for every present value).
    • Protobuf needs to identify every field which takes a byte; StreamInput/StreamOutput need an boolean to implement optional.

Protobuf's primary benefit is in backwards compatibility.

There are probably additional significant benefits of gRPC but this meta issue doesn't seem to be about changing the entire protocol, so I don't think those should be considered as part of the benefits proposed.

The performance impacts are very situation specific and shouldn't be assumed; it's possible a change in streaming to variable-length integers/longs (VInt/ZInt) can achieve similar gains.

@finnegancarroll
Copy link
Contributor Author

Hi @dbwiddis, thanks for the feedback!

  • I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information

I am noticing protobuf objects I implement write more bytes to stream than the native "writeTo" serialization. Small protobuf objects are about ~20% larger but the difference shrinks as request content grows. So far I've been focusing on FetchSearchResult and it's members. If i'm understanding correctly the best use case for protobuf (only considering performance) would be something like a stats endpoint which contains a lot of ints/longs, particularly those which are not UUIDs/hashes and could be squished into a single byte representation.

As I learn more about protobuf I'm understanding any potential performance improvements just from making a change to serialization is gong to be speculative and I'll update this issue with benchmarks as I run them. Currently the concrete expected benefits would be out of the box backwards compatibility as well as providing a stepping stone for gRPC support.

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Sep 3, 2024

Initial numbers for protobuf vs native serializers.
Run with 5 data nodes (r5.xlarge) per cluster against OS 2.16.
So far there is no apparent speedup and protobuf is as performant as the native protocol in this case. Vector workload in progress.

OSB Big 5

2.16 min distribution

|                                        90th percentile latency |                                  desc_sort_timestamp |     76.5407 |      ms |
|                                        90th percentile latency |                                                 term |     69.2465 |      ms |
|                                        90th percentile latency |                                  multi_terms-keyword |     71.7096 |      ms |
|                                        90th percentile latency |                                               scroll |      503024 |      ms |
|                                        90th percentile latency |                              query-string-on-message |     76.0508 |      ms |

FetchSearchResults as protobuf - Branch

|                                        90th percentile latency |                                  desc_sort_timestamp |     78.7184 |      ms |
|                                        90th percentile latency |                                                 term |     69.6506 |      ms |
|                                        90th percentile latency |                                  multi_terms-keyword |     69.0668 |      ms |
|                                        90th percentile latency |                                               scroll |      560605 |      ms |
|                                        90th percentile latency |                              query-string-on-message |     74.4521 |      ms |

Notes

  • Several FetchSearchResults fields could be further decomposed into protobuf. For example SearchHits.collapseFields is still encoded with it’s native writeTo implementation.
  • Initial POC work (diff viewable here) created protobuf exclusive copies of structures within the search path. This may have benefited performance by allowing inlining or reducing vtable lookups when handling transport layer responses. An interesting read on inlining potentially relevant to our use of writeTo in our native serialization.
  • Search results payload may not be large enough to see any performance difference in the above workloads. Workloads with larger search results may show a more substantial performance difference.
  • As noted above a key area where we expect benefit from protobuf is automatic variable length encoding for numeric values. FetchSearchResult is not particularly suited to this case and the same functionality exists in the native protocol (although underused in some cases).
  • This implementation deserializes the message immediately. With a slightly different implementation some message types could benefit from protobuf's lazy loading functioncality.

@dblock
Copy link
Member

dblock commented Sep 4, 2024

Let's keep in mind that while node-to-node we may not see as much benefit, client-to-server switching from JSON to protobuf should. There's also a significant improvement in developer experience if we can replace the native implementation with protobuf.

@finnegancarroll
Copy link
Contributor Author

Some additional benchmarks for a vector search workload.
Again no noticeable difference in performance.

5 data nodes (r5.xlarge) per cluster. 10000 queries against sift-128-euclidean.hdf5 data set.

FetchSearchResults as Protobuf

|                                                 Min Throughput |         prod-queries |        1.82 |  ops/s |
|                                                Mean Throughput |         prod-queries |       13.69 |  ops/s |
|                                              Median Throughput |         prod-queries |       13.84 |  ops/s |
|                                                 Max Throughput |         prod-queries |       13.93 |  ops/s |
|                                        50th percentile latency |         prod-queries |     70.1863 |     ms |
|                                        90th percentile latency |         prod-queries |     71.6104 |     ms |
|                                        99th percentile latency |         prod-queries |     73.9335 |     ms |
|                                      99.9th percentile latency |         prod-queries |     79.8549 |     ms |
|                                     99.99th percentile latency |         prod-queries |      273.26 |     ms |
|                                       100th percentile latency |         prod-queries |      546.72 |     ms |
|                                   50th percentile service time |         prod-queries |     70.1863 |     ms |
|                                   90th percentile service time |         prod-queries |     71.6104 |     ms |
|                                   99th percentile service time |         prod-queries |     73.9335 |     ms |
|                                 99.9th percentile service time |         prod-queries |     79.8549 |     ms |
|                                99.99th percentile service time |         prod-queries |      273.26 |     ms |
|                                  100th percentile service time |         prod-queries |      546.72 |     ms |
|                                                     error rate |         prod-queries |           0 |      % |
|                                                  Mean recall@k |         prod-queries |        0.97 |        |
|                                                  Mean recall@1 |         prod-queries |        0.99 |        |

2.16 Native serialization

|                                                 Min Throughput |         prod-queries |        1.63 |  ops/s |
|                                                Mean Throughput |         prod-queries |       13.49 |  ops/s |
|                                              Median Throughput |         prod-queries |       13.73 |  ops/s |
|                                                 Max Throughput |         prod-queries |       13.84 |  ops/s |
|                                        50th percentile latency |         prod-queries |      70.525 |     ms |
|                                        90th percentile latency |         prod-queries |     71.8059 |     ms |
|                                        99th percentile latency |         prod-queries |     77.1621 |     ms |
|                                      99.9th percentile latency |         prod-queries |      83.521 |     ms |
|                                     99.99th percentile latency |         prod-queries |     446.145 |     ms |
|                                       100th percentile latency |         prod-queries |     610.186 |     ms |
|                                   50th percentile service time |         prod-queries |      70.525 |     ms |
|                                   90th percentile service time |         prod-queries |     71.8059 |     ms |
|                                   99th percentile service time |         prod-queries |     77.1621 |     ms |
|                                 99.9th percentile service time |         prod-queries |      83.521 |     ms |
|                                99.99th percentile service time |         prod-queries |     446.145 |     ms |
|                                  100th percentile service time |         prod-queries |     610.186 |     ms |
|                                                     error rate |         prod-queries |           0 |      % |
|                                                  Mean recall@k |         prod-queries |        0.97 |        |
|                                                  Mean recall@1 |         prod-queries |        0.99 |        |

@getsaurabh02 getsaurabh02 added Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label v2.18.0 Issues and PRs related to version 2.18.0 labels Sep 9, 2024
@Pallavi-AWS
Copy link
Member

@finnegancarroll can we run vector search workload with client-to-server switching? Thanks.

@finnegancarroll finnegancarroll changed the title [META] <Leverage protobuf for serializing select node-to-node objects> [META] Leverage protobuf for serializing select node-to-node objects Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label Search:Performance v2.17.0 v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: New
Status: In Progress
Status: 🆕 New
Development

No branches or pull requests

6 participants