Bugfixes, benchmarks and improvements to FlatMap#1882
Open
kennyweiss wants to merge 28 commits into
Open
Conversation
Adds typed tests covering assignment over a non-empty target, source preservation, and self-assignment.
Removing it cannot break callers since this would not have compiled. Const callers should use find()/at()/count()/contains(). at() throws std::out_of_range on a missing key.
DeviceHashHelper returned axom::IndexType and integer keys were converted before the 64-bit mixer ran. With AXOM_USE_64BIT_INDEXTYPE=OFF every key wider than 32 bits is truncated first, so keys equal mod 2^32 produce identical final hashes. This was happening in the Morton codes in spin's SparseOctreeLevel and in numerics/quadrature.
The floating-point specialization returned the key converted to an integer. Every key sharing an integer part therefore collided -- e.g. all numbers between -1 and 1 converted to the integer 0, so a FlatMap keyed on fractional floats degenerated into one probe chain with O(size) inserts and finds
The quadratic probe advance in probeIndex and probeEmptyIndex wrapped using a mod (%) operator. Since the group count is always a power of two, we can use a bitmask instead. Adds a cross-group probe stress test: a degenerate hash drives 600 keys through one initial group so inserts, lookups, misses, erases, and reinserts all walk and wrap the group sequence.
BM_Find_Hit looks keys up in the order they were inserted. Since node-based maps walk the heap nearly sequentially, the hardware prefetcher hides their pointer-chasing latency. This commit adds find_hit_shuffled (same keys, independently shuffled lookup order) and find_hit_randkeys (distinct pseudorandom 64-bit keys, shuffled lookup order) to better exhibit expected lookup behavior.
When find_with_hash() in not inlined, every lookup is more expensive (extra registers, and a stack spill for the key) and requires loop-invariant setup that cannot be hoisted out of the caller's lookup loop. Forcing the probe path inline removed 20-40% of find_hit time and 15-35% of find_miss time for FlatMap<int64,int64> at n = 2^16 and 2^20.
`getEmplacePos()` computed `Hash{}(key)`, then called `find(key)`,
which hashed the same key a second time.
It then performed a floating-point division against MAX_LOAD_FACTOR
on every insertion to decide whether to grow.
Note: This reduced instruction count but the performance improvements
within run-to-run noise in our measurements.
FlatMap rounds its group count up to a power of two, so for a fixed element count the achievable load factors form a geometric ladder and a nominal target is quantized to the next rung at or below it. At n = 2^16 the 0.70 target and the default reserve(n) geometry coincide (actual load factor 0.533, which is why find_hit_lf0p70 reproduced find_hit to within noise), and the 0.50 target lands at 0.267 -- a table twice as large. That scenario was really measuring a larger working set, not a shorter probe sequence.
The SSE2 path of GroupBucket::visitHashBucket() stops visiting as soon as the visitor returns false, but the scalar fallback (including GPU path) ignored the return value and kept scanning all 15 slots. In-tree visitors and the duplicate check in the batched insert path return false to mean 'stop', and extra visits load and compare a key which could incur a cache miss per probe group.
Emplacing a new key walked the probe sequence twice -- first to check for a key and then to find an empty slot within the key. We now do both within a single call.
* Disables sequential find_hit search by default since it is not representative. * Guards several tests by the feature they are testing
Also adds more device hashing tests
Also improves device hashing of floating point types (float and long double).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
std::unordered_map, googlesparsehashandstd::mapResults/comparisons
Serial benchmark results using a RelWithDebInfo config (lower is better)
We now get roughly comparable or better results in serial -- compared to
std::unordered_map,std::mapand our vendored google sparsehash.FlatMapis our default hash function,FlatMapFastHashis a different hash function that appears to be somewhat faster.Hashing 32K pairs ($2^{15}$ )
Hashing 1M pairs ($2^{20}$ )
Serial vs. OMP vs GPU
This branch has some modest speedups vs. develop (showing serial and omp for this branch against axom@develop)
Showing SEQ and OMP with {1,2,4,8,16,32,64} threads and run with
Hashing 32K pairs ($2^{15}$ )
Hashing 1M pairs ($2^{20}$ )
Hashing 32M pairs ($2^{25}$ )