Skip to content

Bugfixes, benchmarks and improvements to FlatMap#1882

Open
kennyweiss wants to merge 28 commits into
developfrom
feature/kweiss/flatmap-improvements
Open

Bugfixes, benchmarks and improvements to FlatMap#1882
kennyweiss wants to merge 28 commits into
developfrom
feature/kweiss/flatmap-improvements

Conversation

@kennyweiss

@kennyweiss kennyweiss commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • This PR adds some bugfixes and performance improvements to axom::FlatMap
  • It also adds an initial benchmark suite for FlatMap against std::unordered_map, google sparsehash and std::map
  • Bugfixes:
    • There were some bugs related to truncating hashes to 32 bits (when IndexType is 32 bits), and in casting from float to int, and in using operator[] on const maps and in the copy-assign operator.
  • Optimizations
    • Since the hashes are powers of 2, we can use bitmasks rather than mod (%)
    • Specialized batch insertion for sequential exec policy, where we don't need to worry about synchronization

Results/comparisons

Serial benchmark results using a RelWithDebInfo config (lower is better)

We now get roughly comparable or better results in serial -- compared to std::unordered_map, std::map and our vendored google sparsehash. FlatMap is our default hash function, FlatMapFastHash is a different hash function that appears to be somewhat faster.

Hashing 32K pairs ($2^{15}$)

image

Hashing 1M pairs ($2^{20}$)

image

Serial vs. OMP vs GPU

This branch has some modest speedups vs. develop (showing serial and omp for this branch against axom@develop)
Showing SEQ and OMP with {1,2,4,8,16,32,64} threads and run with

OMP_NUM_THREADS=<n> OMP_PLACES=cores OMP_PROC_BIND=close

Hashing 32K pairs ($2^{15}$)

core_flatmap_speedup_wall_N32768

Hashing 1M pairs ($2^{20}$)

core_flatmap_speedup_wall_N1048576

Hashing 32M pairs ($2^{25}$)

core_flatmap_speedup_wall_N33554432

Adds typed tests covering assignment over a non-empty target,
source preservation, and self-assignment.
Removing it cannot break callers since this would not have compiled.
Const callers should use find()/at()/count()/contains().
at() throws std::out_of_range on a missing key.
DeviceHashHelper returned axom::IndexType and integer keys were converted
before the 64-bit mixer ran. With AXOM_USE_64BIT_INDEXTYPE=OFF every key wider than 32 bits
is truncated first, so keys equal mod 2^32 produce identical final hashes.
This was happening in the Morton codes in spin's SparseOctreeLevel and in numerics/quadrature.
The floating-point specialization returned the key converted to an integer.
Every key sharing an integer part therefore collided --
e.g. all numbers between -1 and 1 converted to the integer 0,
so a FlatMap keyed on fractional floats degenerated into one probe chain with O(size) inserts and finds
The quadratic probe advance in probeIndex and probeEmptyIndex wrapped
using a mod (%) operator. Since the group count is always a power of two,
we can use a bitmask instead.

Adds a cross-group probe stress test: a degenerate hash drives 600
keys through one initial group so inserts, lookups, misses, erases,
and reinserts all walk and wrap the group sequence.
@kennyweiss kennyweiss self-assigned this Jun 11, 2026
@kennyweiss kennyweiss added bug Something isn't working Core Issues related to Axom's 'core' component Performance Issues related to code performance labels Jun 11, 2026
BM_Find_Hit looks keys up in the order they were inserted.
Since node-based maps walk the heap nearly sequentially,
the hardware prefetcher hides their pointer-chasing latency.

This commit adds find_hit_shuffled (same keys, independently shuffled lookup order)
and find_hit_randkeys (distinct pseudorandom 64-bit keys, shuffled lookup order)
to better exhibit expected lookup behavior.
When find_with_hash() in not inlined, every lookup is more expensive
(extra registers, and a stack spill for the key) and requires loop-invariant setup
that cannot be hoisted out of the caller's lookup loop.

Forcing the probe path inline removed 20-40% of find_hit time and 15-35%
of find_miss time for FlatMap<int64,int64> at n = 2^16 and 2^20.
`getEmplacePos()` computed `Hash{}(key)`, then called `find(key)`,
which hashed the same key a second time.
It then performed a floating-point division against MAX_LOAD_FACTOR
on every insertion to decide whether to grow.

Note: This reduced instruction count but the performance improvements
within run-to-run noise in our measurements.
FlatMap rounds its group count up to a power of two, so for a fixed
element count the achievable load factors form a geometric ladder and a
nominal target is quantized to the next rung at or below it. At n = 2^16
the 0.70 target and the default reserve(n) geometry coincide (actual load
factor 0.533, which is why find_hit_lf0p70 reproduced find_hit to within
noise), and the 0.50 target lands at 0.267 -- a table twice as large.
That scenario was really measuring a larger working set, not a shorter
probe sequence.
The SSE2 path of GroupBucket::visitHashBucket() stops visiting as soon as
the visitor returns false, but the scalar fallback (including GPU path)
ignored the return value and kept scanning all 15 slots.

In-tree visitors and the duplicate check in the batched insert path
return false to mean 'stop', and extra visits load and compare a key
which could incur a cache miss per probe group.
Emplacing a new key walked the probe sequence twice -- first to check
for a key and then to find an empty slot within the key. We now do
both within a single call.
* Disables sequential find_hit search by default since it is not representative.
* Guards several tests by the feature they are testing
Also adds more device hashing tests
Also improves device hashing of floating point types (float and long double).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Core Issues related to Axom's 'core' component Performance Issues related to code performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant