-
Notifications
You must be signed in to change notification settings - Fork 264
Home
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
—Antoine de Saint-Exupery
μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.
upb is written in 2300 sloc of C, and compiles to just under 30kb of object code on x86.
The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?
Most protobuf implementations focus on code generation as their primary means of achieving speed. “Code generation” in this context means using a compiler to translate a .proto
file to C or C++ code that is specific to those .proto types. A C or C++ compiler is then used to output machine code that can parse, serialize, or manipulate those types.
Code generation can achieve high speeds, but also has a high cost:
- The generated code can be large
-
descriptor.proto
, which can be represented as a 3.5kb protobuf, compiles to >150kb of machine code on x86. If you have a binary that processes lots of message types, this code can really add up. - You have to link in any message types you want to parse
- This means you have to decide ahead of time what messages you might possibly want to process, and you pay the size and compile time hit for all of them. Whenever they change, you have to recompile.
- There is an extra step in your edit/compile/run cycle
- Or worse, if you didn’t have an edit/compile/run cycle before (like with interpreted languages), you do now.
- The generated code is inflexible
- Generated code achieves it speed by compiling for one very specific configuration. In other words, it takes all your decisions about how you want to parse and fixes them at compile time. This means that the generated code is only good for one very specific purpose. Want to change the set of fields you care about? Recompile. Want to reference the input strings instead of copying them? Recompile. Want to do callback-based parsing instead of parsing into the stock data structures? Recompile.
upb was designed with the belief that protobuf parsing without code generation could achieve speeds comparable to code generation. If this can be achieved, we can avoid the drawbacks of code generation. Programs need only compile the upb core (<50k object code), and all .proto files can be loaded at runtime as they are needed.
Current benchmarks indicate that upb is never slower than 70% the speed of the official release of protobufs (with the official release doing code generation and upb dynamically loading .proto types). In some tests, the speed difference is even less.
Even if upb can’t achieve 100% the speed of code generation in an apples-to-apples comparison, upb can come out ahead by offering the flexibility to perform optimizations that are not easy or practical with current code generation approaches. The most significant examples are:
- Skipping fields/submesages you don’t need.
- The protobuf format makes it possible to skip submessages very efficiently. If you are only reading a small portion of a large, nested protobuf, you can get the fields you need in orders of magnitude less time than it would take to parse the whole thing.
- Lazy parsing of submessages (not implemented yet).
- A slightly different take on the previous point, it is possible to parse submessages only if/when they are accessed. This can achieve the same speeds as the previous without requiring you to statically analyze the set of fields you need. The downside is that parse errors surface later and unsynchronized reads are no longer thread-safe.
- Referencing input string data instead of copying.
- If the input contains strings, it is possible to reference them from the input string instead of paying for
malloc()
andmemcpy()
. This might be desirable in some cases but not others — a non-code-generation approach lets you decide at runtime. - Callback/Event-based parsing
- Event-based parsing (like SAX in XML) can be much more efficient than parsing into a data structure.
The dynamic nature of upb is especially useful in the context of dynamic or interpreted languages. upb is specifically designed to be an ideal target for dynamic language extensions.
Protocol Buffers has an enormous potential to be useful to users of dynamic languages. It provides a format that languages can use to exchange data in a very efficient way. It provides the efficiency benefits of using built-in serialization formats like Python’s “Pickle”, Perl’s “Storable,” and Ruby’s “Marshal”, or JavaScript’s “JSON”, but with a more explicit schema and greater interoperability across languages.
Despite this promise, Protocol Buffers haven’t seen much adoption in dynamic languages because the existing implementations aren’t very efficient. upb was designed from the outset to be an ideal implementation for supporting very fast Protocol Buffers implementations for dynamic languages. This is much of the reason upb is focused on making the runtime dynamic and configurable (ie. no code generation), so that .proto types are easy to load at runtime and flexible in the ways you can process them.
Another important feature is developing memory-management interfaces that can integrate with the memory managers of dynamic languages. This is no easy task, because each language runtime does memory management differently. Some use reference counting, some use garbage collection, some use a combination, and the interfaces for interacting with the memory managers are different for every runtime. A key goal of upb was to design a memory management scheme that could gracefully integrate with all of these.
upb is designed to be a toolbox of paradigms for manipulating protocol buffer data. upb is built in layers, and any of the layers are available for clients to use as they see fit. For example, the lowest layer (event parsing) can be built and used completely independently of any of the layers that sit on top of it.
Each of these layers is documented in its own wiki page. These wiki pages are the best way to understand the design of upb.