-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Optimize varint decoding without intrinsics #20531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
CLA signed |
I haven't looked much into the TailCallTable parsing mechanism much, but replacing some code in the non-fast table could also improve performance. I'm not sure how feasible it is to replace the whole Edit: I have looked and I don't think it's particularly useful |
Sorry had to force push to change details about the commit author |
Could I have someone else take a look at this? The process seems to be stalled. |
Hello, @acozzette, I've seen you around the previous varint opt PRs so maybe you'd be willing to take look? Thanks in advance. |
Woops, did something wrong in the rebase haha. Sorry everyone. The test failures seem unrelated to the code itself so I did a rebase hoping it might help. |
Looks like it's still not building, more merge issues? |
@@ -929,15 +939,60 @@ template <typename T> | |||
return p + 2; | |||
} | |||
return VarintParseSlowArm(p, out, first8); | |||
#else // __aarch64__ | |||
#elif defined(__x86_64__) // __aarch64__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this optimization meant for x86_64 or arm64? If it's x86_64, the build error is because you're using ValueBarrier, which we only define for arm64 builds. If it's for arm64, you need to fix this line and probably call that out in the PR description. We tend to be willing to accept a lot of ugly code for x86_64 optimization that deliver, but for arm64 the bar would likely be higher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. This is an optimization for x86_64. I'll fix that soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, also note that this varint parser is not particularly load-bearing anymore. ShiftMixParseVarint is our table-driven parser's implementation that's used for parsing most protobuf data on the wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware. I tried to look into if this would he helpful for table parsing but concluded it wasn't. I was still surprised it managed an extra 4%.
A little unrelated but I was writing a protobuf library in Rust which uses this in a load-bearing manner :)
src/google/protobuf/parse_context.h
Outdated
// should be faster in the general case | ||
|
||
// Input is guaranteed atleast 10 bytes | ||
uint32_t value = *reinterpret_cast<const uint32_t*>(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're seeing crashes here from alignment issues, does using memcpy affect the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any good resources on the alignment requirements of C++ pointers. For x86_64, a memcpy should optimize to the same but I did notice UnalignedLoad
having poor optimization on LLVM (though I could not remember if I chose a big endian target for this, but I opted out of everything but x86_64 since it's the only platform I've tested and which I have confidence in the results).
Are the crashes from the address sanitizer? Theres no reason why it should crash during normal runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing SIGILLs from clang, with and without sanitizers. From what I found this is strictly UB in C++ due to the alignment requirements of uint32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I'll check it out. Thanks for your time.
For the mean time I've moved the I did remember to run the tests this time, they seem to have passed. |
Oh great, thanks Jason. The integrations also have passed. |
This PR builds on #10646 and #13158 to provide a more optimized varint decoder while addressing previous concerns.
This code does not use any intrinsics and is concise in order to fit in a single cache line for maximum efficiency.
See https://godbolt.org/z/zdxhzr1oY for more details (note that it doesn't fit here but inlining will help).
Edit: I've since fixed the clang codegen issue.
Explanation
We read a uint32 and extracts relevant bits via bit manipulation, taking the cold path only if the varint is longer than 4 bytes. This integer size is picked because it strikes a reasonable balance between bit manip overhead and branch probability.
This gives us a range of
0..268,435,456
on the hot path, or0..33,554,432
if it's tagged for a field.Benchmarking details
libprotobuf compiler: gcc 11.4
allocator: jemalloc
arena: none
cpu: amd 5800x
linking: static
lto: no
input message: ~40kb, varies in message types, varints, and bytes
results: average of 10 runs, each run is 500,000 iterations
op: construct, parse from memory array, destruct
control: 23.2 microsecs/op
optimized: 22.4 microsecs/op
microbench: 1.7-2.0 nanosecs/call
About a 3-4% improvement overall. Not much but for servers which may parse millions of protobuf messages, it's not insignificant