-
Notifications
You must be signed in to change notification settings - Fork 649
base: add a new FlatHashMap based on Swiss Table concepts #4205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
LalitMaganti
wants to merge
1
commit into
dev/lalitm/fhm
Choose a base branch
from
dev/lalitm/fhm-2
base: dev/lalitm/fhm
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,045
−325
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note: this CL does *not* change any production code to actually use this. All it's doing is adding the implementation, benchmarks and tests. Future CL will actually do the switch. Our current FlatHashMap implementation has seved us well. It was originally designed just to replace std::unordered_map inside of TraceProcessor's StringPool but due to its nice perf characteristics, it's come to be used throughout the Perfetto codebase. But over time I've realised that it has some flows which make it not suitable to be a truly general purpose hash table: * very sensitive to tombstones. we've made some modifications to rehash if there are tombstones but in heavy insert/erase workloads it's not enough * very sensitive to collisions: if there is a collision, the default is quadratic probing, which can cause some quite sub-optimal memory access patterns. And using linear probing doesn't work for high load factor hash maps. * suboptimal memory layout: putting the ctrl/keys/values in different arrays means that we're jumping around memory instead of being in a single place, especially important for small maps. All in all, these are all things that absl::flat_has_map and Swiss tables have solved. There's no reason for us not to take advantage of at least their memory layout and orgniaation ideas even if we cannot use them directly. If they're good enough for much of the C++ world (and definitely all of Google) they should be good enough for us too! So this CL adds a new FlatHashMapV2 which is based on Swiss Table concepths. Specifically; * The idea of storing the control bytes, keys and values all in the same array * The logic to probe using triangular numbers and do group based probing. * Using SIMD to access 8/16 elements at a time and do comparisions of hashes. * The idea to store 7 bytes of a hash in the control byte to do fast checks and reduce the probability of a false match to 1/128 before even checking keys * Using SWAR for non-SIMD platforms (including ARM64 though we might optimize this with NEON in the future depending on what we see) The benchmarks are quite good for general cases. For TraceStrings: BM_HashMap_InsertTraceStrings_AppendOnly 554254588 ns 553651249 ns 1 409.798M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<Ours_PreHashed> 698876584 ns 698092350 ns 1 325.008M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<OursV2_PreHashed> 662225759 ns 661307563 ns 1 343.086M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed> 2376921965 ns 2374903076 ns 1 95.5346M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed> 631832647 ns 631347455 ns 1 359.367M/s 226.885M 1.17868M This is the only case where the old implementation wins (which makes sense because it was specifically designed for it!). But I don't think a loss here justifies the major wins elsewhere (especially also because in the "normal" case, cache misses would drown out the small differences here anyway given how widespread interning is in practice. It also only wins with AppendOnly which is rarely used. For TraceTids: BM_HashMap_TraceTids<Ours_Tid> 741954852 ns 741389092 ns 1 183.063M/s 16.099k BM_HashMap_TraceTids<OursV2_Tid> 504387549 ns 503475605 ns 1 269.568M/s 16.099k BM_HashMap_TraceTids<StdUnorderedMap_Tid> 1401707453 ns 1400083835 ns 1 96.9378M/s 16.099k BM_HashMap_TraceTids<AbslFlatHashMap_Tid> 510576964 ns 510105976 ns 1 266.064M/s 16.099k We're signifcantly better than V1 here: tids is exactly the sort of "general behaviour" which it's very common to see in Perfetto and where V1 behaves less well than I'd like. For LookupRandInts: BM_HashMap_LookupRandInts<Ours_Default> 574181027 ns 573859405 ns 1 17.4259M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_Default> 350956969 ns 350711465 ns 2 28.5135M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Default> 571851786 ns 571499561 ns 1 17.4978M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Default> 366032708 ns 365819315 ns 2 27.3359M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur> 611111313 ns 610638342 ns 1 16.3763M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur> 363292801 ns 362867428 ns 2 27.5583M/s 10.7378P BM_HashMap_LookupRandInts<Ours_PreHashed> 713060262 ns 712048102 ns 1 14.044M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_PreHashed> 329546786 ns 329117625 ns 2 30.3843M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed> 580294840 ns 579547604 ns 1 17.2548M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed> 340158501 ns 339918114 ns 2 29.4188M/s 10.7378P Another handy win which appears regardless of whether the hash function is good or not. Notice how in the V1 case, prehashed causes poor degeneration of the function but V2 is actually *faster*. For InsertCollidingInts: BM_HashMap_InsertCollidingInts<Ours_Default> 701989492 ns 701578722 ns 1 14.2536M/s BM_HashMap_InsertCollidingInts<OursV2_Default> 427725299 ns 427302016 ns 2 23.4027M/s BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default> 896047311 ns 895384570 ns 1 11.1684M/s BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default> 420245635 ns 420006692 ns 2 23.8091M/s Another big win for V2. Another similar example to the sort we'd see in practice in Perfetto. For InsertDupeInts: BM_HashMap_InsertDupeInts<Ours_Default> 63926278 ns 63874785 ns 11 156.556M/s BM_HashMap_InsertDupeInts<OursV2_Default> 40291231 ns 40268380 ns 17 248.334M/s BM_HashMap_InsertDupeInts<StdUnorderedMap_Default> 37554643 ns 37527602 ns 19 266.471M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default> 29026714 ns 29008928 ns 24 344.721M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur> 38883999 ns 38857011 ns 18 257.354M/s Same as above, another common case. Surprisingly, std does very well here. And we seem to be bottlenecked on the hash function given how much better absl does with its own hash vs murmur (its hash is slightly lower quality but also faster). This is the only benchmark where this happens thoough, others don't show this behaviour. For *String: BM_HashMap_HeterogeneousLookup_String<Ours_String> 575812136 ns 575383946 ns 1 17.3797M/s BM_HashMap_HeterogeneousLookup_String<OursV2_String> 420531571 ns 420120838 ns 2 23.8027M/s BM_HashMap_RegularLookup_String<Ours_String> 820167931 ns 819494830 ns 1 12.2026M/s BM_HashMap_RegularLookup_String<OursV2_String> 405975726 ns 405482422 ns 2 24.662M/s BM_HashMap_RegularLookup_String<StdUnorderedMap_String> 1577833963 ns 1576585547 ns 1 6.34282M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String> 491132758 ns 490802459 ns 2 20.3748M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur> 571520036 ns 571134272 ns 1 17.509M/s All clear wins for us. We're doing actually much better than even absl here! InsertVaryingSizes: BM_HashMap_InsertVaryingSize<Ours_Default>/100 754 ns 754 ns 932167 132.677M/s BM_HashMap_InsertVaryingSize<Ours_Default>/10000 200216 ns 200096 ns 3531 49.9759M/s BM_HashMap_InsertVaryingSize<Ours_Default>/1000000 54744741 ns 54674135 ns 13 18.2902M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/100 1381 ns 1380 ns 505094 72.4606M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/10000 164971 ns 164851 ns 4250 60.6608M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000 38351821 ns 38310617 ns 18 26.1024M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100 1268 ns 1267 ns 551576 78.9371M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000 137933 ns 137845 ns 5065 72.5454M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000 39980643 ns 39943874 ns 18 25.0351M/s Note how v1 does better for very small maps but drops off as maps get bigger. I think this is probably the effect of being able to have everything in cache even with the different key/value/ctrl maps and SIMD not being as useful with such small maps. LookupWithMisses: BM_HashMap_LookupWithMisses<Ours_Default>/0 359571712 ns 359090314 ns 2 27.8481M/s BM_HashMap_LookupWithMisses<Ours_Default>/50 426399559 ns 425828348 ns 2 23.4836M/s BM_HashMap_LookupWithMisses<Ours_Default>/100 308777908 ns 308628319 ns 2 32.4014M/s BM_HashMap_LookupWithMisses<OursV2_Default>/0 195633544 ns 195481786 ns 4 51.1557M/s BM_HashMap_LookupWithMisses<OursV2_Default>/50 221380263 ns 221180357 ns 3 45.212M/s BM_HashMap_LookupWithMisses<OursV2_Default>/100 139350634 ns 139236152 ns 5 71.8204M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0 201567396 ns 201412429 ns 4 49.6494M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50 206378283 ns 206259753 ns 3 48.4826M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100 123735273 ns 123601106 ns 6 80.9054M/s Notice how much much better we do when we have a low/high miss rate. This is indicative of the branch predicted doing a much better job of actually dealing wit our usecases. But OTOH even 50% we are far suprerior. InsertSequential: BM_HashMap_InsertSequentialInts<Ours_Default> 781732694 ns 781402467 ns 1 12.7975M/s BM_HashMap_InsertSequentialInts<OursV2_Default> 423644733 ns 423098276 ns 2 23.6352M/s BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default> 473308044 ns 472810618 ns 2 21.1501M/s BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default> 420184095 ns 419919524 ns 2 23.8141M/s Huge win for what is quite a common case in trace processor. Even std is doing better than us here which is quite surprising! LookupSequential: BM_HashMap_LookupSequentialInts<Ours_Default> 586390209 ns 585999772 ns 1 17.0649M/s BM_HashMap_LookupSequentialInts<OursV2_Default> 344488692 ns 344312148 ns 2 29.0434M/s BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default> 49338334 ns 49299886 ns 15 202.84M/s BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default> 230591313 ns 230421841 ns 3 43.3987M/s Std destroys everyone which I find highly surprising and don't have a good explanation for. Maybe it's because it has a perfect hash function basically (since by default std::hash just returns the int?). Regardless we're doing better than V1.
🎨 Perfetto UI Build✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-20608617586-1-ui/ui/index.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: this CL does not change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.
Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.
But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:
rehash if there are tombstones but in heavy insert/erase workloads
it's not enough
quadratic probing, which can cause some quite sub-optimal memory
access patterns. And using linear probing doesn't work for high load
factor hash maps.
arrays means that we're jumping around memory instead of being in a
single place, especially important for small maps.
All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!
So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;
array
probing.
hashes.
checks and reduce the probability of a false match to 1/128 before
even checking keys
optimize this with NEON in the future depending on what we see)
The benchmarks are quite good for general cases.
For TraceStrings:
This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.
For TraceTids:
We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.
For LookupRandInts:
Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually faster.
For InsertCollidingInts:
Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.
For InsertDupeInts:
Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.
For *String:
All clear wins for us. We're doing actually much better than even absl
here!
InsertVaryingSizes:
Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.
LookupWithMisses:
Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.
InsertSequential:
Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!
LookupSequential:
Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.