-
Notifications
You must be signed in to change notification settings - Fork 15.7k
Optimize varint decoding without intrinsics #20531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
CLA signed |
I haven't looked much into the TailCallTable parsing mechanism much, but replacing some code in the non-fast table could also improve performance. I'm not sure how feasible it is to replace the whole Edit: I have looked and I don't think it's particularly useful |
Sorry had to force push to change details about the commit author |
Could I have someone else take a look at this? The process seems to be stalled. |
Hello, @acozzette, I've seen you around the previous varint opt PRs so maybe you'd be willing to take look? Thanks in advance. |
Woops, did something wrong in the rebase haha. Sorry everyone. The test failures seem unrelated to the code itself so I did a rebase hoping it might help. |
This PR builds on #10646 and #13158 to provide a more optimized varint decoder while addressing previous concerns.
This code does not use any intrinsics and is concise in order to fit in a single cache line for maximum efficiency.
See https://godbolt.org/z/zdxhzr1oY for more details (note that it doesn't fit here but inlining will help).
Edit: I've since fixed the clang codegen issue.
Explanation
We read a uint32 and extracts relevant bits via bit manipulation, taking the cold path only if the varint is longer than 4 bytes. This integer size is picked because it strikes a reasonable balance between bit manip overhead and branch probability.
This gives us a range of
0..268,435,456
on the hot path, or0..33,554,432
if it's tagged for a field.Benchmarking details
libprotobuf compiler: gcc 11.4
allocator: jemalloc
arena: none
cpu: amd 5800x
linking: static
lto: no
input message: ~40kb, varies in message types, varints, and bytes
results: average of 10 runs, each run is 500,000 iterations
op: construct, parse from memory array, destruct
control: 23.2 microsecs/op
optimized: 22.4 microsecs/op
microbench: 1.7-2.0 nanosecs/call
About a 3-4% improvement overall. Not much but for servers which may parse millions of protobuf messages, it's not insignificant