perf(raw_table): optimize find_inner with manual bitmask drain#707
perf(raw_table): optimize find_inner with manual bitmask drain#7070xdeafbeef wants to merge 1 commit intorust-lang:masterfrom
Conversation
|
Actually just resolving my comments here because I didn't read the issue correctly. I guess that it's unfortunate that you have to add in extra methods that must be called correctly, but if it's a performance win, that's okay. |
Yep, unfortunate that for loop gives a huge penalty. There is still a win when using it, but it's not as big. It would be nice if someone who knows how LLVM works could take a look. My guess is that I can share |
Change `RawTable::find_inner()` to walk `match_tag()` results with `lowest_set_bit()` / `remove_lowest_bit()` instead of the iterator path, and compute `has_empty` before checking matching buckets. Add `BitMask::normalize_for_iteration()` so manual draining stays correct on backends where one logical match may occupy multiple raw bits.
|
Could you also include an LLVM IR diff? It might be easier to read than an asm diff. |
|
|
compared pub fn lookup_serial(iterations: usize) -> usize {
let mut m: FoldHashMap<usize, DropType> = FoldHashMap::default();
for i in (0..).take(SIZE) {
m.insert(i, DropType(i));
}
for _ in 0..iterations {
for i in (0..).take(SIZE) {
black_box(m.get(&i));
}
}
}
where hashmap is from master or from my fork as path dependency. obtained with than llvm-diff in function FUNC_reserve_rehash:
in block %bb15.i:
> tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !121
< tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !117
in function FUNC_fallible_with_capacity:
in block %bb11:
> tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17
< tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17
in block %bb13.i:
> tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !5
< tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !5
in function FUNC_lookup_serial:
in block %bb20:
> %iter4.sroa.0.021 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
> %iter2.sroa.0.020 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
< %iter4.sroa.0.019 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
< %iter2.sroa.0.018 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
in block %bb1.i.i.i:
> %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb24.i.i.i ]
> %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb24.i.i.i ]
> %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %bucket_mask.i.i.i
> %_20.i.i.i = getelementptr inbounds nuw i8, ptr %_22.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
> %dst.sroa.0.0.copyload.i17.i.i = load <16 x i8>, ptr %_20.i.i.i, align 1, !noalias !50
> %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, splat (i8 -1)
> %38 = bitcast <16 x i1> %37 to i16
> %has_empty.not.i.i.i = icmp eq i16 %38, 0
> %39 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, %36
> %40 = bitcast <16 x i1> %39 to i16
> %41 = icmp eq i16 %40, 0
> br i1 %41, label %bb4.i.i.i, label %bb16.i.i.i, !prof !6
< %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb20.i.i.i ]
< %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb20.i.i.i ]
< %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %_16.i.i.i
< %_17.i.i.i = getelementptr inbounds nuw i8, ptr %_19.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
< %dst.sroa.0.0.copyload.i18.i.i = load <16 x i8>, ptr %_17.i.i.i, align 1, !noalias !48
< %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i18.i.i, %36
< %38 = bitcast <16 x i1> %37 to i16
< br label %bb2.i.i.i
in block %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i / %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i:
> %48 = phi ptr [ %43, %bb16.i.i.i ], [ null, %bb4.i.i.i ]
> %.not.i = icmp eq ptr %48, null
> %v.i = getelementptr inbounds i8, ptr %48, i64 -8
> %_0.sroa.0.1.i = select i1 %.not.i, ptr null, ptr %v.i
< %_0.sroa.3.0.i.i.i = phi i64 [ %index.i.i.i, %bb10.i.i.i ], [ undef, %bb11.i.i.i ]
< %_18.i.i = sub nsw i64 0, %_0.sroa.3.0.i.i.i
< %48 = getelementptr inbounds { i64, i64 }, ptr %_19.i.i.i, i64 %_18.i.i
< %_0.sroa.0.0.i.i = select i1 %.not.i.not.i.i, ptr null, ptr %48
< %v.i = getelementptr inbounds i8, ptr %_0.sroa.0.0.i.i, i64 -8
< %_0.sroa.0.1.i = select i1 %.not.i.not.i.i, ptr null, ptr %v.i
%_28 = add nuw nsw i64 %iter4.sroa.0.019, 1
%49 = add nsw i64 %iter2.sroa.0.018, -1
call void @llvm.lifetime.start.p0(ptr nonnull %0)
> store ptr %_0.sroa.0.1.i, ptr %0, align 8
< store ptr %_0.sroa.0.1.i, ptr %0, align 8 |
|
Ok, I've started cleaning up criterion bench suite and found that unfortunately this change only improves numbers for nightly feature. So my guess is that the nightly feature makes hashbrown use the Without nightly feature it also gives up-to 20%, but in the opposite direction :) Thanks, and sorry for the noise!
|
|
Would love to merge those benchmark updates to the repo if you're willing to adapt them for that. |
This pr reduces overhead in a hot
RawTableprobe loops. The main improvement shows up infind_inner()It replaces the iterator based walk over the tag match mask with a manual
lowest_set_bit/remove_lowest_bitdrain, and addsBitMask::normalize_for_iteration()so that drain stays correct on backends where one logical match spans multiple raw bits.I've tried to apply same optimisations to
find_or_find_insert_index_inner(), but they gave no measurable differences, so I've left the original for loop. And in contrast hoisting in find_or_find_insert_index_inner gives measurable 4% difference, so it's preserved, but removed fromfind_inner.Hashing, probing behavior and observable side effects are unchanged.
I think that improvement come from the better codegen around the hot
dyn FnMutpredicate call: less probe state stays live across the candidate walk, and the loop operates on cached locals instead of repeatedly reloading fromself.Benches
Summary:
Tests
Added regression tests covering:
I also saw a strange regression when running the crate's nightly
#[bench]tests, but separate harness using Criterion and valgrind did not reproduce it forgrow_insert_foldhash_highbits(valgrind gave the same cycle by cycle numbers). I can push it as a separate repo if needed( basically it's a copy-paste of the benches ported to Criterion)asm diff for find inner(pretty hard to cleanly compare them because of inlining):
Looks like loop become tighter due to manual iterator + more state lives in registers