Optimize varint decoding without intrinsics #20531

davidzeng0 · 2025-02-28T19:18:13Z

This PR builds on #10646 and #13158 to provide a more optimized varint decoder while addressing previous concerns.

This code does not use any intrinsics and is concise in order to fit in a single cache line for maximum efficiency.

See https://godbolt.org/z/zdxhzr1oY for more details (note that it doesn't fit here but inlining will help).
Edit: I've since fixed the clang codegen issue.

Explanation

We read a uint32 and extracts relevant bits via bit manipulation, taking the cold path only if the varint is longer than 4 bytes. This integer size is picked because it strikes a reasonable balance between bit manip overhead and branch probability.

This gives us a range of 0..268,435,456 on the hot path, or 0..33,554,432 if it's tagged for a field.

Benchmarking details

libprotobuf compiler: gcc 11.4
allocator: jemalloc
arena: none
cpu: amd 5800x
linking: static
lto: no

input message: ~40kb, varies in message types, varints, and bytes
results: average of 10 runs, each run is 500,000 iterations
op: construct, parse from memory array, destruct

control: 23.2 microsecs/op
optimized: 22.4 microsecs/op
microbench: 1.7-2.0 nanosecs/call

About a 3-4% improvement overall. Not much but for servers which may parse millions of protobuf messages, it's not insignificant

google-cla · 2025-02-28T19:18:17Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

davidzeng0 · 2025-02-28T23:52:28Z

CLA signed

davidzeng0 · 2025-03-06T02:53:37Z

I haven't looked much into the TailCallTable parsing mechanism much, but replacing some code in the non-fast table could also improve performance.

I'm not sure how feasible it is to replace the whole ParseLoop and TagDispatch mechanism with this as well, my understanding is that the fast table has to be somewhat small to be worth it

Edit: I have looked and I don't think it's particularly useful

davidzeng0 · 2025-03-13T10:15:29Z

Sorry had to force push to change details about the commit author

davidzeng0 · 2025-03-31T23:09:38Z

Could I have someone else take a look at this? The process seems to be stalled.

davidzeng0 · 2025-04-16T18:14:48Z

Hello, @acozzette, I've seen you around the previous varint opt PRs so maybe you'd be willing to take look? Thanks in advance.

davidzeng0 · 2025-04-17T19:28:32Z

Woops, did something wrong in the rebase haha. Sorry everyone.

The test failures seem unrelated to the code itself so I did a rebase hoping it might help.

mkruskal-google · 2025-05-22T23:30:28Z

Looks like it's still not building, more merge issues?

mkruskal-google · 2025-05-22T23:44:12Z

src/google/protobuf/parse_context.h

@@ -929,15 +939,60 @@ template <typename T>
    return p + 2;
  }
  return VarintParseSlowArm(p, out, first8);
-#else   // __aarch64__
+#elif defined(__x86_64__) // __aarch64__


Is this optimization meant for x86_64 or arm64? If it's x86_64, the build error is because you're using ValueBarrier, which we only define for arm64 builds. If it's for arm64, you need to fix this line and probably call that out in the PR description. We tend to be willing to accept a lot of ugly code for x86_64 optimization that deliver, but for arm64 the bar would likely be higher

Aha. This is an optimization for x86_64. I'll fix that soon.

Cool, also note that this varint parser is not particularly load-bearing anymore. ShiftMixParseVarint is our table-driven parser's implementation that's used for parsing most protobuf data on the wire.

I'm aware. I tried to look into if this would he helpful for table parsing but concluded it wasn't. I was still surprised it managed an extra 4%.

A little unrelated but I was writing a protobuf library in Rust which uses this in a load-bearing manner :)

mkruskal-google · 2025-05-23T01:54:00Z

src/google/protobuf/parse_context.h

+  // should be faster in the general case
+
+  // Input is guaranteed atleast 10 bytes
+  uint32_t value = *reinterpret_cast<const uint32_t*>(p);


We're seeing crashes here from alignment issues, does using memcpy affect the results?

I couldn't find any good resources on the alignment requirements of C++ pointers. For x86_64, a memcpy should optimize to the same but I did notice UnalignedLoad having poor optimization on LLVM (though I could not remember if I chose a big endian target for this, but I opted out of everything but x86_64 since it's the only platform I've tested and which I have confidence in the results).

Are the crashes from the address sanitizer? Theres no reason why it should crash during normal runs.

I'm seeing SIGILLs from clang, with and without sanitizers. From what I found this is strictly UB in C++ due to the alignment requirements of uint32_t

Ah. I'll check it out. Thanks for your time.

davidzeng0 · 2025-06-08T17:03:01Z

For the mean time I've moved the ValueBarriers outside of the guards and added a memcpy. The memcpy adds an additional mov in godbolt which is suspicious but I don't have access to my work computer right now so I'm a little limited in what I can do.

I did remember to run the tests this time, they seem to have passed.

davidzeng0 · 2025-06-08T17:04:28Z

Oh great, thanks Jason. The integrations also have passed.

zhangskz added cla: no c++ labels Feb 28, 2025

google-cla bot added cla: yes and removed cla: no labels Mar 10, 2025

acozzette added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Mar 11, 2025

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Mar 11, 2025

acozzette requested a review from sbenzaquen March 12, 2025 20:10

davidzeng0 force-pushed the main branch from 860e744 to e3e1ad9 Compare March 13, 2025 09:05

acozzette added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Apr 17, 2025

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Apr 17, 2025

davidzeng0 requested review from a team as code owners April 17, 2025 19:27

davidzeng0 requested review from JasonLunn, ericsalo and thomasvl and removed request for a team April 17, 2025 19:27

davidzeng0 requested review from googleberg, Logofile and jskeet and removed request for a team April 17, 2025 19:27

google-cla bot added cla: no and removed cla: yes labels Apr 17, 2025

davidzeng0 force-pushed the main branch from 52531df to d15b163 Compare April 17, 2025 19:29

google-cla bot added cla: yes and removed cla: no labels Apr 17, 2025

Logofile removed their request for review April 21, 2025 20:05

mkruskal-google assigned sbenzaquen May 22, 2025

mkruskal-google added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label May 22, 2025

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label May 22, 2025

mkruskal-google reviewed May 22, 2025

View reviewed changes

mkruskal-google reviewed May 23, 2025

View reviewed changes

jskeet removed their request for review May 23, 2025 15:06

davidzeng0 added 5 commits June 8, 2025 16:08

faster varint decoding

46a1bd7

fix grammar

242f2b9

fix clang codegen

b13e3ed

use valuebarrier instead

ce46315

fix issues

872b087

davidzeng0 force-pushed the main branch from d15b163 to 872b087 Compare June 8, 2025 16:11

JasonLunn added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Jun 8, 2025

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Jun 8, 2025

davidzeng0 requested a review from mkruskal-google June 8, 2025 17:03

Optimize varint decoding without intrinsics #20531

Are you sure you want to change the base?

Optimize varint decoding without intrinsics #20531

Uh oh!

Conversation

davidzeng0 commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation

Benchmarking details

Uh oh!

google-cla bot commented Feb 28, 2025

Uh oh!

davidzeng0 commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidzeng0 commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidzeng0 commented Mar 13, 2025

Uh oh!

davidzeng0 commented Mar 31, 2025

Uh oh!

davidzeng0 commented Apr 16, 2025

Uh oh!

davidzeng0 commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkruskal-google commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidzeng0 commented Jun 8, 2025

Uh oh!

davidzeng0 commented Jun 8, 2025

Uh oh!

Uh oh!

davidzeng0 commented Feb 28, 2025 •

edited

Loading

davidzeng0 commented Feb 28, 2025 •

edited

Loading

davidzeng0 commented Mar 6, 2025 •

edited

Loading

davidzeng0 commented Apr 17, 2025 •

edited

Loading