Skip to content

Conversation

@LalitMaganti
Copy link
Member

@LalitMaganti LalitMaganti commented Dec 31, 2025

Note: this CL does not change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.

Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.

But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:

  • very sensitive to tombstones. we've made some modifications to
    rehash if there are tombstones but in heavy insert/erase workloads
    it's not enough
  • very sensitive to collisions: if there is a collision, the default is
    quadratic probing, which can cause some quite sub-optimal memory
    access patterns. And using linear probing doesn't work for high load
    factor hash maps.
  • suboptimal memory layout: putting the ctrl/keys/values in different
    arrays means that we're jumping around memory instead of being in a
    single place, especially important for small maps.

All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!

So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;

  • The idea of storing the control bytes, keys and values all in the same
    array
  • The logic to probe using triangular numbers and do group based
    probing.
  • Using SIMD to access 8/16 elements at a time and do comparisions of
    hashes.
  • The idea to store 7 bytes of a hash in the control byte to do fast
    checks and reduce the probability of a false match to 1/128 before
    even checking keys
  • Using SWAR for non-SIMD platforms (including ARM64 though we might
    optimize this with NEON in the future depending on what we see)

The benchmarks are quite good for general cases.

For TraceStrings:

BM_HashMap_InsertTraceStrings_AppendOnly                        554254588 ns    553651249 ns            1 409.798M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<Ours_PreHashed>                   698876584 ns    698092350 ns            1 325.008M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<OursV2_PreHashed>                 662225759 ns    661307563 ns            1 343.086M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed>       2376921965 ns   2374903076 ns            1 95.5346M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed>        631832647 ns    631347455 ns            1 359.367M/s   226.885M   1.17868M

This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.

For TraceTids:

BM_HashMap_TraceTids<Ours_Tid>                                  741954852 ns    741389092 ns            1 183.063M/s    16.099k
BM_HashMap_TraceTids<OursV2_Tid>                                504387549 ns    503475605 ns            1 269.568M/s    16.099k
BM_HashMap_TraceTids<StdUnorderedMap_Tid>                      1401707453 ns   1400083835 ns            1 96.9378M/s    16.099k
BM_HashMap_TraceTids<AbslFlatHashMap_Tid>                       510576964 ns    510105976 ns            1 266.064M/s    16.099k

We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.

For LookupRandInts:

BM_HashMap_LookupRandInts<Ours_Default>                         574181027 ns    573859405 ns            1 17.4259M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_Default>                       350956969 ns    350711465 ns            2 28.5135M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Default>              571851786 ns    571499561 ns            1 17.4978M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Default>              366032708 ns    365819315 ns            2 27.3359M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur>               611111313 ns    610638342 ns            1 16.3763M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur>               363292801 ns    362867428 ns            2 27.5583M/s   10.7378P
BM_HashMap_LookupRandInts<Ours_PreHashed>                       713060262 ns    712048102 ns            1  14.044M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_PreHashed>                     329546786 ns    329117625 ns            2 30.3843M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed>            580294840 ns    579547604 ns            1 17.2548M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed>            340158501 ns    339918114 ns            2 29.4188M/s   10.7378P

Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually faster.

For InsertCollidingInts:

BM_HashMap_InsertCollidingInts<Ours_Default>                    701989492 ns    701578722 ns            1 14.2536M/s
BM_HashMap_InsertCollidingInts<OursV2_Default>                  427725299 ns    427302016 ns            2 23.4027M/s
BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default>         896047311 ns    895384570 ns            1 11.1684M/s
BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default>         420245635 ns    420006692 ns            2 23.8091M/s

Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.

For InsertDupeInts:

BM_HashMap_InsertDupeInts<Ours_Default>                          63926278 ns     63874785 ns           11 156.556M/s
BM_HashMap_InsertDupeInts<OursV2_Default>                        40291231 ns     40268380 ns           17 248.334M/s
BM_HashMap_InsertDupeInts<StdUnorderedMap_Default>               37554643 ns     37527602 ns           19 266.471M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default>               29026714 ns     29008928 ns           24 344.721M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur>                38883999 ns     38857011 ns           18 257.354M/s

Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.

For *String:

BM_HashMap_HeterogeneousLookup_String<Ours_String>              575812136 ns    575383946 ns            1 17.3797M/s
BM_HashMap_HeterogeneousLookup_String<OursV2_String>            420531571 ns    420120838 ns            2 23.8027M/s
BM_HashMap_RegularLookup_String<Ours_String>                    820167931 ns    819494830 ns            1 12.2026M/s
BM_HashMap_RegularLookup_String<OursV2_String>                  405975726 ns    405482422 ns            2  24.662M/s
BM_HashMap_RegularLookup_String<StdUnorderedMap_String>        1577833963 ns   1576585547 ns            1 6.34282M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String>         491132758 ns    490802459 ns            2 20.3748M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur>  571520036 ns    571134272 ns            1  17.509M/s

All clear wins for us. We're doing actually much better than even absl
here!

InsertVaryingSizes:

BM_HashMap_InsertVaryingSize<Ours_Default>/100                        754 ns          754 ns       932167 132.677M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/10000                   200216 ns       200096 ns         3531 49.9759M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/1000000               54744741 ns     54674135 ns           13 18.2902M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/100                     1381 ns         1380 ns       505094 72.4606M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/10000                 164971 ns       164851 ns         4250 60.6608M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000             38351821 ns     38310617 ns           18 26.1024M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100            1268 ns         1267 ns       551576 78.9371M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000        137933 ns       137845 ns         5065 72.5454M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000    39980643 ns     39943874 ns           18 25.0351M/s

Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.

LookupWithMisses:

BM_HashMap_LookupWithMisses<Ours_Default>/0                     359571712 ns    359090314 ns            2 27.8481M/s
BM_HashMap_LookupWithMisses<Ours_Default>/50                    426399559 ns    425828348 ns            2 23.4836M/s
BM_HashMap_LookupWithMisses<Ours_Default>/100                   308777908 ns    308628319 ns            2 32.4014M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/0                   195633544 ns    195481786 ns            4 51.1557M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/50                  221380263 ns    221180357 ns            3  45.212M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/100                 139350634 ns    139236152 ns            5 71.8204M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0          201567396 ns    201412429 ns            4 49.6494M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50         206378283 ns    206259753 ns            3 48.4826M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100        123735273 ns    123601106 ns            6 80.9054M/s

Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.

InsertSequential:

BM_HashMap_InsertSequentialInts<Ours_Default>                   781732694 ns    781402467 ns            1 12.7975M/s
BM_HashMap_InsertSequentialInts<OursV2_Default>                 423644733 ns    423098276 ns            2 23.6352M/s
BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default>        473308044 ns    472810618 ns            2 21.1501M/s
BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default>        420184095 ns    419919524 ns            2 23.8141M/s

Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!

LookupSequential:

BM_HashMap_LookupSequentialInts<Ours_Default>                   586390209 ns    585999772 ns            1 17.0649M/s
BM_HashMap_LookupSequentialInts<OursV2_Default>                 344488692 ns    344312148 ns            2 29.0434M/s
BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default>         49338334 ns     49299886 ns           15  202.84M/s
BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default>        230591313 ns    230421841 ns            3 43.3987M/s

Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.

Note: this CL does *not* change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.

Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.

But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:
* very sensitive to tombstones. we've made some modifications to
  rehash if there are tombstones but in heavy insert/erase workloads
  it's not enough
* very sensitive to collisions: if there is a collision, the default is
  quadratic probing, which can cause some quite sub-optimal memory
  access patterns. And using linear probing doesn't work for high load
  factor hash maps.
* suboptimal memory layout: putting the ctrl/keys/values in different
  arrays means that we're jumping around memory instead of being in a
  single place, especially important for small maps.

All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!

So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;
* The idea of storing the control bytes, keys and values all in the same
  array
* The logic to probe using triangular numbers and do group based
  probing.
* Using SIMD to access 8/16 elements at a time and do comparisions of
  hashes.
* The idea to store 7 bytes of a hash in the control byte to do fast
  checks and reduce the probability of a false match to 1/128 before
  even checking keys
* Using SWAR for non-SIMD platforms (including ARM64 though we might
  optimize this with NEON in the future depending on what we see)

The benchmarks are quite good for general cases.

For TraceStrings:

BM_HashMap_InsertTraceStrings_AppendOnly                        554254588 ns    553651249 ns            1 409.798M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<Ours_PreHashed>                   698876584 ns    698092350 ns            1 325.008M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<OursV2_PreHashed>                 662225759 ns    661307563 ns            1 343.086M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed>       2376921965 ns   2374903076 ns            1 95.5346M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed>        631832647 ns    631347455 ns            1 359.367M/s   226.885M   1.17868M

This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.

For TraceTids:

BM_HashMap_TraceTids<Ours_Tid>                                  741954852 ns    741389092 ns            1 183.063M/s    16.099k
BM_HashMap_TraceTids<OursV2_Tid>                                504387549 ns    503475605 ns            1 269.568M/s    16.099k
BM_HashMap_TraceTids<StdUnorderedMap_Tid>                      1401707453 ns   1400083835 ns            1 96.9378M/s    16.099k
BM_HashMap_TraceTids<AbslFlatHashMap_Tid>                       510576964 ns    510105976 ns            1 266.064M/s    16.099k

We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.

For LookupRandInts:

BM_HashMap_LookupRandInts<Ours_Default>                         574181027 ns    573859405 ns            1 17.4259M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_Default>                       350956969 ns    350711465 ns            2 28.5135M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Default>              571851786 ns    571499561 ns            1 17.4978M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Default>              366032708 ns    365819315 ns            2 27.3359M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur>               611111313 ns    610638342 ns            1 16.3763M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur>               363292801 ns    362867428 ns            2 27.5583M/s   10.7378P
BM_HashMap_LookupRandInts<Ours_PreHashed>                       713060262 ns    712048102 ns            1  14.044M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_PreHashed>                     329546786 ns    329117625 ns            2 30.3843M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed>            580294840 ns    579547604 ns            1 17.2548M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed>            340158501 ns    339918114 ns            2 29.4188M/s   10.7378P

Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually *faster*.

For InsertCollidingInts:

BM_HashMap_InsertCollidingInts<Ours_Default>                    701989492 ns    701578722 ns            1 14.2536M/s
BM_HashMap_InsertCollidingInts<OursV2_Default>                  427725299 ns    427302016 ns            2 23.4027M/s
BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default>         896047311 ns    895384570 ns            1 11.1684M/s
BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default>         420245635 ns    420006692 ns            2 23.8091M/s

Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.

For InsertDupeInts:

BM_HashMap_InsertDupeInts<Ours_Default>                          63926278 ns     63874785 ns           11 156.556M/s
BM_HashMap_InsertDupeInts<OursV2_Default>                        40291231 ns     40268380 ns           17 248.334M/s
BM_HashMap_InsertDupeInts<StdUnorderedMap_Default>               37554643 ns     37527602 ns           19 266.471M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default>               29026714 ns     29008928 ns           24 344.721M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur>                38883999 ns     38857011 ns           18 257.354M/s

Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.

For *String:

BM_HashMap_HeterogeneousLookup_String<Ours_String>              575812136 ns    575383946 ns            1 17.3797M/s
BM_HashMap_HeterogeneousLookup_String<OursV2_String>            420531571 ns    420120838 ns            2 23.8027M/s
BM_HashMap_RegularLookup_String<Ours_String>                    820167931 ns    819494830 ns            1 12.2026M/s
BM_HashMap_RegularLookup_String<OursV2_String>                  405975726 ns    405482422 ns            2  24.662M/s
BM_HashMap_RegularLookup_String<StdUnorderedMap_String>        1577833963 ns   1576585547 ns            1 6.34282M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String>         491132758 ns    490802459 ns            2 20.3748M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur>  571520036 ns    571134272 ns            1  17.509M/s

All clear wins for us. We're doing actually much better than even absl
here!

InsertVaryingSizes:

BM_HashMap_InsertVaryingSize<Ours_Default>/100                        754 ns          754 ns       932167 132.677M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/10000                   200216 ns       200096 ns         3531 49.9759M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/1000000               54744741 ns     54674135 ns           13 18.2902M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/100                     1381 ns         1380 ns       505094 72.4606M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/10000                 164971 ns       164851 ns         4250 60.6608M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000             38351821 ns     38310617 ns           18 26.1024M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100            1268 ns         1267 ns       551576 78.9371M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000        137933 ns       137845 ns         5065 72.5454M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000    39980643 ns     39943874 ns           18 25.0351M/s

Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.

LookupWithMisses:

BM_HashMap_LookupWithMisses<Ours_Default>/0                     359571712 ns    359090314 ns            2 27.8481M/s
BM_HashMap_LookupWithMisses<Ours_Default>/50                    426399559 ns    425828348 ns            2 23.4836M/s
BM_HashMap_LookupWithMisses<Ours_Default>/100                   308777908 ns    308628319 ns            2 32.4014M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/0                   195633544 ns    195481786 ns            4 51.1557M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/50                  221380263 ns    221180357 ns            3  45.212M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/100                 139350634 ns    139236152 ns            5 71.8204M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0          201567396 ns    201412429 ns            4 49.6494M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50         206378283 ns    206259753 ns            3 48.4826M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100        123735273 ns    123601106 ns            6 80.9054M/s

Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.

InsertSequential:

BM_HashMap_InsertSequentialInts<Ours_Default>                   781732694 ns    781402467 ns            1 12.7975M/s
BM_HashMap_InsertSequentialInts<OursV2_Default>                 423644733 ns    423098276 ns            2 23.6352M/s
BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default>        473308044 ns    472810618 ns            2 21.1501M/s
BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default>        420184095 ns    419919524 ns            2 23.8141M/s

Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!

LookupSequential:

BM_HashMap_LookupSequentialInts<Ours_Default>                   586390209 ns    585999772 ns            1 17.0649M/s
BM_HashMap_LookupSequentialInts<OursV2_Default>                 344488692 ns    344312148 ns            2 29.0434M/s
BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default>         49338334 ns     49299886 ns           15  202.84M/s
BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default>        230591313 ns    230421841 ns            3 43.3987M/s

Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.
@LalitMaganti LalitMaganti requested a review from a team as a code owner December 31, 2025 00:02
@github-actions
Copy link

🎨 Perfetto UI Build

✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-20608617586-1-ui/ui/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant