Make builder use memory that I allocated, before #335

aeJuergenS · 2025-04-24T14:06:44Z

aeJuergenS
Apr 24, 2025

Hey together,

usually, the flatcc builder operates on the the stack of my embedded firmware.
Is there a way that I allocate my own memory (e.g. with malloc()) and make the flatcc builder use it?

I found a hint to a function flatcc_builder_use_memory(), that takes a pointer to my pre-allocated memory, but I could not find it in builder.c or any other description about it. The GCC compiler throws an error ("implicit declaration of function").

Can aynone help, please?
Thanks and greetings!
Juergen

mikkelfj · 2025-04-24T15:44:14Z

mikkelfj
Apr 24, 2025
Maintainer

Yes, you can control all levels of allocation. If all you want is to replace malloc with some embedded equivalent, just redefine the settings in: https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_alloc.h

The above is easy. If you need more, you probably need more guidance, but below is a few hints:

If you need full control, you can do more. The builder has like maybe 5 separate stacks that grow as necessary. You can profile your use case and allocate a fixed block for each type of allocation, or change the default initial allocation for each stack, then add a custom allocator for that. This is mostly if you only have few KB to play with.

The emitter is a backend that convert the builder stacks into linear buffer memory. The default emitter uses pages and allocates a full buffer in the end. You can easily create your own, including partial transmission, and recombination at the other end of a network, e.g., or just use a separate allocator - it depends on flatcc_alloc.h

0 replies

aeJuergenS · 2025-05-02T14:04:06Z

aeJuergenS
May 2, 2025
Author

I think that replacing the malloc()'s with my own RTOS malloc() does not help improving speed.

For writing an own emitter my expertise is not yet high enough, but as far as I understand the Flatbuffers concept, the main part of the work (and time) is spend in the builder and not in the emitter.

Is there another means of improving speed?
(On my ARM Cortex A15 with CPU speed of 1000 MHz, exporting my (nested) C-struct with 400 elements (mostly floats) into a Flatbuffer takes approx. 6.5ms, which seems very long given the 1000 MHz CPU.)
Another question:
I noticed that the size of the generated Flatbuffer (return value of flatcc_builder_finalize_buffer()) changes from call to call although the structure of the data does not change (only the values). I get sizes between approx. 1900 and 2000 bytes.
Using flatcc_builder_finalize_aligned_buffer() instead didn't help.

Thanks!
Juergen

0 replies

mikkelfj · 2025-05-03T10:24:04Z

mikkelfj
May 3, 2025
Maintainer

Modern allocators are very fast. Embedded special purpose allocators are typically slower and are optimized for code size or less memory overhead.

You usually replace an allocator because you have to, not because of speed.

For FlatCC, it already caches allocations. For example, it keeps a stack of information to track positions and sizes of different objects until it can write the information to the emitter and reduce the stack. If the initial allocation is too small, it will just reallocate, the stack but it will not happen every time you write data. By creating you own allocator, you can fine tune this to avoid allocating too much or too little initially. It is rarely worthwhile doing.

If you write multiple buffers, you can reset the builder instead of clearing it and creating a new one. If done right, the already allocated memory will be reused, both in the emitter and the builder. builder.h and doc should mention this.

Before covering performance optimizations, let me answer you second point, because it likely affects performance:

No, the size should NOT change. The binary output should, with few execeptions, be exactly the same every time you build a buffer with the same builder code and input data (you can build buffers differently using different API calls, like changing the order of 8-bit and 64-bit values might affect padding, just as one example, but if your code does the same, the output should be the same).

This means that you code is very likely incorrect in some import way, or that you are misinterpreting what you are doing, e.g. you are using different data when you think you are not. Before you address this issue, it makes no sense to discuss performance.

The only reason I can think of where the binary output differs each time (and not the size) is if padding is not zeroed. This can happen if you use an optimized path for writing structs where they are copied from source to destination in raw form because your C compiler's padding is not guaranteed to zero pad space between values. This can leak information and lead to different buffers, which is best avoided unless performance is absolute critical. The direct copy approach also assumes that you are always running no little endian hardware because no translation is present.

I do not recall all the API details for achieving this, but if you are interested we can figure that out. However, Flatbuffers is already much faster than nearly any other serialization protocol, so you have to ask yourself if it is that important.

Now let's look at your numbers:

If you only have one struct, and it has 400 elements, lets say 32-bit values, then you roughly compare it to writing 400 integers. You can do it faster with a raw copy of the buffer as discussed above. If you use struct buffers instead of flatbuffers, you avoid the root table, and just write a struct in a standard format, but it only works for flatcc, not other languages. This is almost as fast as writing the raw buffer, but you likely don't need it.

If you look at https://github.com/dvidelabs/flatcc/tree/master/test/benchmark
you will see the numbers to expect. I haven't really touched this in a decade. benchraw is the theoretical limit without using flatbuffers, just constructing a raw struct in memory for comparison. benchflatcc is the main benchmark.
The published benchmarks are over 10 years old, I think on a 2.2GHz core i7 laptop.
https://github.com/dvidelabs/flatcc/blob/master/doc/benchmarks.md

NOTE: it is REALLY hard to write a benchmark where the C compiler does not cheat and optimize out some code, unless you disable optimizations, which defeat the purpose. It is not guaranteed that the benchmark is valid with current compilers.

If you look at the published benchmark you will see about 55 ns per op, vs. 12 ns for raw struct writes. I don't recall exactly what one op is, it might include a string copy mixed into the average. It will probably be roughly twice that for your hardware at 1GHz, maybe more if it is less cabable embedded processor (less instruction level parallelism). However, we can roughly say 100 ns per struct element in your case, then 400 * 100 ns is 40 us, which is 100x faster than 6.5 ms your are getting.

Now, the benchmark is for a small buffer, and you might experience reallocation of the internal stacks in your data set. If you reset the builder and repeat, such that internal stacks (and pages for the emitter), are already warm, you might see some improvements.

As for the overhead of writing one element:

Each value written in struct must be read, converted to little indian, then written. This will be optimized to single write on little endian platforms, but the entire struct is not copied in a single operation. The difference might be small since a struct still has to move memory.

If you build buffers where you nest construction instead of first completely building child elements before writing them, you are forcing the flatcc builder to use a stack instead of writing directly to the emitter. However, when that is convenient to do, you are likely just moving complexity to your own code if you try to avoid it. You will monster examples in C++ and other languages where child elements are constructured first. The C version has two versions, one that uses nesting and simplifies the code.

Understanding exactly what is stacked and what is written directly is not trivial. And rarely worthwhile dealing with. If you really need performance, you start with figuring out how to get data directly to the emitter as fast as possible, and then you potentially optimize the emitter. The default is not slow, but it does make an extra copy from pages to a final buffer.

If reallocation of stacks is a concern, you can change some constants in builder.h that define initial sizes, I think, don't recall exactly. You don't have to create a new allocator. Before doing that it is best to add some instrumentation to log when reallocation happens.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make builder use memory that I allocated, before #335

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Make builder use memory that I allocated, before #335

Uh oh!

aeJuergenS Apr 24, 2025

Replies: 3 comments

Uh oh!

Uh oh!

mikkelfj Apr 24, 2025 Maintainer

Uh oh!

aeJuergenS May 2, 2025 Author

Uh oh!

mikkelfj May 3, 2025 Maintainer

aeJuergenS
Apr 24, 2025

mikkelfj
Apr 24, 2025
Maintainer

aeJuergenS
May 2, 2025
Author

mikkelfj
May 3, 2025
Maintainer