Skip to content

perf(raw_table): optimize find_inner with manual bitmask drain#707

Closed
0xdeafbeef wants to merge 1 commit intorust-lang:masterfrom
0xdeafbeef:faster-gets
Closed

perf(raw_table): optimize find_inner with manual bitmask drain#707
0xdeafbeef wants to merge 1 commit intorust-lang:masterfrom
0xdeafbeef:faster-gets

Conversation

@0xdeafbeef
Copy link
Copy Markdown
Contributor

This pr reduces overhead in a hot RawTable probe loops. The main improvement shows up in find_inner()

It replaces the iterator based walk over the tag match mask with a manual lowest_set_bit / remove_lowest_bit drain, and adds BitMask::normalize_for_iteration() so that drain stays correct on backends where one logical match spans multiple raw bits.

I've tried to apply same optimisations to find_or_find_insert_index_inner(), but they gave no measurable differences, so I've left the original for loop. And in contrast hoisting in find_or_find_insert_index_inner gives measurable 4% difference, so it's preserved, but removed from find_inner.

Hashing, probing behavior and observable side effects are unchanged.

I think that improvement come from the better codegen around the hot dyn FnMut predicate call: less probe state stays live across the candidate walk, and the loop operates on cached locals instead of repeatedly reloading from self.

Benches

Summary:

  • successful lookups improve by roughly 18–25%
  • failed lookups improve by roughly 3–13% in most cases
  • non-lookup benches are mostly flat
case                             master_ns      patch_ns   delta_pct
clone_from_large                   2939.25       2949.47        0.35
clone_from_small                     19.67         19.46       -1.07
clone_large                        2967.42       2960.48       -0.23
clone_small                          24.17         24.40        0.92
grow_insert_highbits              18896.80      19079.76        0.97
grow_insert_random                18743.52      18742.87       -0.00
grow_insert_serial                16797.08      16847.38        0.30
insert_erase_highbits             12540.49      12447.08       -0.74
insert_erase_random               10689.10      10461.05       -2.13
insert_erase_serial               10460.40      10209.30       -2.40
insert_highbits                   11443.34      11370.29       -0.64
insert_random                     10290.13      10322.59        0.32
insert_serial                     10215.50      10157.46       -0.57
iter_highbits                       656.22        658.88        0.41
iter_random                         655.23        659.15        0.60
iter_serial                         656.92        655.08       -0.28
loadfactor_lookup_14500            1251.21       1015.28      -18.86
loadfactor_lookup_16500            1240.77       1017.01      -18.03
loadfactor_lookup_18500            1256.04       1023.11      -18.54
loadfactor_lookup_20500            1246.81       1019.73      -18.21
loadfactor_lookup_22500            1244.30       1018.81      -18.12
loadfactor_lookup_24500            1249.99       1013.97      -18.88
loadfactor_lookup_26500            1253.29       1016.19      -18.92
loadfactor_lookup_28500            1242.99       1023.86      -17.63
loadfactor_lookup_fail_14500       1142.94       1029.81       -9.90
loadfactor_lookup_fail_16500       1205.04       1098.98       -8.80
loadfactor_lookup_fail_18500       1295.55       1209.64       -6.63
loadfactor_lookup_fail_20500       1460.41       1381.45       -5.41
loadfactor_lookup_fail_22500       1779.98       1721.94       -3.26
loadfactor_lookup_fail_24500       2360.56       2318.27       -1.79
loadfactor_lookup_fail_26500       3495.25       3488.16       -0.20
loadfactor_lookup_fail_28500       5386.85       5485.60        1.83
lookup_fail_highbits               2157.43       1881.92      -12.77
lookup_fail_random                 2284.68       2075.13       -9.17
lookup_fail_serial                 2107.32       1865.77      -11.46
lookup_highbits                    2081.42       1558.49      -25.12
lookup_random                      2259.22       1748.12      -22.62
lookup_serial                      1967.77       1498.16      -23.87
rehash_in_place                  142606.74     141617.22       -0.69

Tests

Added regression tests covering:

  • lookup through a tombstone collision cluster
  • duplicate insert preferring an existing entry over a tombstone
  • predicate call count for a single matching bucket
  • lookup continuing past a full nonmatching group
  • bitmask normalization / manual drain yielding the expected positions

I also saw a strange regression when running the crate's nightly #[bench] tests, but separate harness using Criterion and valgrind did not reproduce it for grow_insert_foldhash_highbits (valgrind gave the same cycle by cycle numbers). I can push it as a separate repo if needed( basically it's a copy-paste of the benches ported to Criterion)

asm diff for find inner(pretty hard to cleanly compare them because of inlining):
@@ -1,18 +1,18 @@
-hashbrown_path_compare::master_backend::lookup_serial:
+hashbrown_path_compare::clean_backend::lookup_serial:
 	push rbp
 	push r15
 	push r14
 	push r13
 	push r12
 	push rbx
-	sub rsp, 72
-	mov qword ptr [rsp], rdi
+	sub rsp, 56
+	mov rbx, rdi
 	call qword ptr [rip + foldhash::seed::gen_per_hasher_seed@GOTPCREL]
 	mov r13, qword ptr [rip + foldhash::seed::global::GLOBAL_SEED_STORAGE@GOTPCREL]
 	movzx ecx, byte ptr [r13 + 48]
 	cmp cl, 2
-	jne .LBB160_1
-.LBB160_2:
+	jne .LBB130_1
+.LBB130_2:
 	mov qword ptr [rsp + 48], rax
 	movups xmm0, xmmword ptr [rip + .Lanon.766d3937d4fd495137a5bc8d3881f52a.31]
 	movaps xmmword ptr [rsp + 16], xmm0
@@ -20,169 +20,159 @@
 	movdqa xmmword ptr [rsp + 32], xmm0
 	xor r14d, r14d
 	lea r15, [rsp + 16]
-	mov rbx, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
-	jmp .LBB160_3
-.LBB160_6:
+	mov r12, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
+	jmp .LBB130_3
+.LBB130_6:
 	inc r14
 	cmp r14, 1000
-	je .LBB160_7
-.LBB160_3:
+	je .LBB130_7
+.LBB130_3:
 	mov rdi, r15
 	mov rsi, r14
 	mov rdx, r14
-	call <hashbrown_master_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>::insert
+	call <hashbrown_clean_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>::insert
 	test rax, rax
-	je .LBB160_6
-	lock addqword ptr [rbx], rdx
-	jmp .LBB160_6
-.LBB160_7:
-	cmp qword ptr [rsp], 0
-	je .LBB160_14
+	je .LBB130_6
+	lock addqword ptr [r12], rdx
+	jmp .LBB130_6
+.LBB130_7:
+	test rbx, rbx
+	je .LBB130_14
 	cmp qword ptr [rsp + 40], 0
-	je .LBB160_10
+	je .LBB130_10
 	mov rcx, qword ptr [rsp + 48]
-	mov r10, qword ptr [rsp + 16]
+	mov rsi, qword ptr [rsp + 16]
 	mov rdi, qword ptr [rsp + 24]
 	xor r8d, r8d
 	pcmpeqd xmm0, xmm0
 	lea r9, [rsp + 8]
+.LBB130_24:
+	mov r10d, 1000
+	xor r11d, r11d
+.LBB130_25:
+	mov rax, rcx
+	xor rax, r11
+	mul qword ptr [r13]
+	xor rdx, rax
+	mov rax, rdx
+	shr rax, 57
+	movd xmm1, eax
+	punpcklbw xmm1, xmm1
+	pshuflw xmm1, xmm1, 0
+	pshufd xmm1, xmm1, 68
 	xor eax, eax
-	jmp .LBB160_24
-.LBB160_23:
-	mov rax, qword ptr [rsp + 64]
-	inc rax
-	cmp rax, qword ptr [rsp]
-	je .LBB160_14
-.LBB160_24:
-	mov qword ptr [rsp + 64], rax
-	mov r11d, 1000
-	xor r14d, r14d
-	jmp .LBB160_25
-.LBB160_26:
+.LBB130_26:
 	and rdx, rdi
-	movdqu xmm2, xmmword ptr [r10 + rdx]
+	movdqu xmm2, xmmword ptr [rsi + rdx]
 	movdqa xmm3, xmm2
 	pcmpeqb xmm3, xmm1
-	pmovmskb r12d, xmm3
-.LBB160_27:
-	mov r15d, r12d
-	test r12w, r12w
-	je .LBB160_31
-	lea r12d, [r15 - 1]
-	rep bsfebp, r15d
-	and r12d, r15d
-	add rbp, rdx
-	and rbp, rdi
-	mov rbx, rbp
-	shl rbx, 4
-	mov rsi, r10
-	sub rsi, rbx
-	cmp r14, qword ptr [rsi - 16]
-	je .LBB160_29
-	jmp .LBB160_27
-.LBB160_31:
+	pmovmskb r14d, xmm3
+	test r14d, r14d
+	jne .LBB130_31
+.LBB130_27:
 	pcmpeqb xmm2, xmm0
-	pmovmskb esi, xmm2
-	test esi, esi
-	jne .LBB160_29
+	pmovmskb ebp, xmm2
+	test bp, bp
+	jne .LBB130_28
 	add rdx, rax
 	add rdx, 16
 	add rax, 16
-	jmp .LBB160_26
-.LBB160_29:
+	jmp .LBB130_26
+.LBB130_31:
+	rep bsfebp, r14d
+	add rbp, rdx
+	and rbp, rdi
 	shl rbp, 4
-	mov rax, r10
-	sub rax, rbp
-	test r15w, r15w
-	cmove rax, r8
-	add rax, -8
-	test r15w, r15w
-	cmove rax, r8
-	inc r14
+	mov r15, rsi
+	sub r15, rbp
+	cmp r11, qword ptr [r15 - 16]
+	je .LBB130_29
+	lea ebp, [r14 - 1]
+	and bp, r14w
+	mov r14d, ebp
+	jne .LBB130_31
+	jmp .LBB130_27
+.LBB130_28:
+	xor r15d, r15d
+.LBB130_29:
+	lea rax, [r15 - 8]
+	test r15, r15
+	cmove rax, r15
+	inc r11
 	mov qword ptr [rsp + 8], rax
 	#APP
 	#NO_APP
-	dec r11
-	je .LBB160_23
-.LBB160_25:
-	mov rax, rcx
-	xor rax, r14
-	mul qword ptr [r13]
-	xor rdx, rax
-	mov rax, rdx
-	shr rax, 57
-	movd xmm1, eax
-	punpcklbw xmm1, xmm1
-	pshuflw xmm1, xmm1, 0
-	pshufd xmm1, xmm1, 68
-	xor eax, eax
-	jmp .LBB160_26
-.LBB160_10:
+	dec r10
+	jne .LBB130_25
+	inc r8
+	cmp r8, rbx
+	jne .LBB130_24
+	jmp .LBB130_14
+.LBB130_10:
 	xor eax, eax
 	lea rcx, [rsp + 8]
-.LBB160_11:
+.LBB130_11:
 	mov rdx, -1000
-.LBB160_12:
+.LBB130_12:
 	mov qword ptr [rsp + 8], 0
 	#APP
 	#NO_APP
 	inc rdx
-	jne .LBB160_12
+	jne .LBB130_12
 	inc rax
-	cmp rax, qword ptr [rsp]
-	jne .LBB160_11
-.LBB160_14:
-	mov r10, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
-	mov rbx, qword ptr [r10]
+	cmp rax, rbx
+	jne .LBB130_11
+.LBB130_14:
+	mov rbx, qword ptr [r12]
 	mov rsi, qword ptr [rsp + 24]
 	test rsi, rsi
-	je .LBB160_22
+	je .LBB130_22
 	mov rax, qword ptr [rsp + 40]
 	test rax, rax
-	je .LBB160_20
+	je .LBB130_20
 	mov rcx, qword ptr [rsp + 16]
 	movdqa xmm0, xmmword ptr [rcx]
 	lea rdx, [rcx + 16]
 	pmovmskb edi, xmm0
 	not edi
-	jmp .LBB160_17
-.LBB160_19:
+	jmp .LBB130_17
+.LBB130_19:
 	rep bsfr8d, edi
 	shl r8d, 4
 	mov r9, rcx
 	sub r9, r8
 	mov r8, qword ptr [r9 - 8]
-	lock addqword ptr [r10], r8
+	lock addqword ptr [r12], r8
 	lea r8d, [rdi - 1]
 	and r8d, edi
 	mov edi, r8d
 	dec rax
-	je .LBB160_20
-.LBB160_17:
+	je .LBB130_20
+.LBB130_17:
 	test di, di
-	jne .LBB160_19
-.LBB160_18:
+	jne .LBB130_19
+.LBB130_18:
 	movdqa xmm0, xmmword ptr [rdx]
 	add rcx, -256
 	add rdx, 16
 	pmovmskb edi, xmm0
 	xor edi, 65535
-	je .LBB160_18
-	jmp .LBB160_19
-.LBB160_20:
+	je .LBB130_18
+	jmp .LBB130_19
+.LBB130_20:
 	mov rax, rsi
 	shl rax, 4
 	add rsi, rax
 	add rsi, 33
-	je .LBB160_22
+	je .LBB130_22
 	mov rdi, qword ptr [rsp + 16]
 	sub rdi, rax
 	add rdi, -16
 	mov edx, 16
 	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
-.LBB160_22:
+.LBB130_22:
 	mov rax, rbx
-	add rsp, 72
+	add rsp, 56
 	pop rbx
 	pop r12
 	pop r13
@@ -190,11 +180,11 @@
 	pop r15
 	pop rbp
 	ret
-.LBB160_1:
+.LBB130_1:
 	mov r14, rax
 	call qword ptr [rip + <foldhash::seed::global::GlobalSeed>::init_slow@GOTPCREL]
 	mov rax, r14
-	jmp .LBB160_2
+	jmp .LBB130_2
 	mov rbx, rax
 	lea rdi, [rsp + 16]
 	call core::ptr::drop_in_place::<hashbrown_clean_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>

Looks like loop become tighter due to manual iterator + more state lives in registers

@0xdeafbeef 0xdeafbeef changed the title perf: hoist invariant state in RawTable probe loops perf(raw_table): optimize find_inner with manual bitmask drain Mar 25, 2026
Comment thread src/raw.rs
Comment thread src/control/bitmask.rs
@clarfonthey
Copy link
Copy Markdown
Contributor

Actually just resolving my comments here because I didn't read the issue correctly.

I guess that it's unfortunate that you have to add in extra methods that must be called correctly, but if it's a performance win, that's okay.

Comment thread src/control/bitmask.rs Outdated
@0xdeafbeef
Copy link
Copy Markdown
Contributor Author

Actually just resolving my comments here because I didn't read the issue correctly.

I guess that it's unfortunate that you have to add in extra methods that must be called correctly, but if it's a performance win, that's okay.

Yep, unfortunate that for loop gives a huge penalty.

 - lookup_serial: +12.46%
  - lookup_highbits: +13.50%
  - lookup_random: +11.49%
  - loadfactor_lookup_14500: +7.49%
  - loadfactor_lookup_16500: +7.36%
  - loadfactor_lookup_20500: +7.50%
  - loadfactor_lookup_26500: +7.83%

There is still a win when using it, but it's not as big.

It would be nice if someone who knows how LLVM works could take a look. My guess is that while let is easier for the optimizer.

I can share cargo asm diff for this case

Change `RawTable::find_inner()` to walk `match_tag()` results with
`lowest_set_bit()` / `remove_lowest_bit()` instead of the iterator path,
and compute `has_empty` before checking matching buckets.

Add `BitMask::normalize_for_iteration()` so manual draining stays
correct on backends where one logical match may occupy multiple raw
bits.
@Amanieu
Copy link
Copy Markdown
Member

Amanieu commented Mar 26, 2026

Could you also include an LLVM IR diff? It might be easier to read than an asm diff.

@0xdeafbeef
Copy link
Copy Markdown
Contributor Author

Could you also include an LLVM IR diff? It might be easier to read than an asm diff.

llvm-diff or just a diff of lir?

@0xdeafbeef
Copy link
Copy Markdown
Contributor Author

compared

   pub fn lookup_serial(iterations: usize) -> usize {
                let mut m: FoldHashMap<usize, DropType> = FoldHashMap::default();
                for i in (0..).take(SIZE) {
                    m.insert(i, DropType(i));
                }

                for _ in 0..iterations {
                    for i in (0..).take(SIZE) {
                        black_box(m.get(&i));
                    }
                }
            }
            
            

where hashmap is from master or from my fork as path dependency.

obtained with cargo rustc --lib --release -- --emit=llvm-ir
+

llvm-extract -S --recursive \
       --func=_RNvNtCs7m6UUWt6V5j_22hashbrown_path_compare13clean_backend13lookup_serial

than llvm-diff

in function FUNC_reserve_rehash:
  in block %bb15.i:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !121
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !117

in function FUNC_fallible_with_capacity:
  in block %bb11:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17
  in block %bb13.i:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !5
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !5

in function FUNC_lookup_serial:
  in block %bb20:
    >   %iter4.sroa.0.021 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
    >   %iter2.sroa.0.020 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
    <   %iter4.sroa.0.019 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
    <   %iter2.sroa.0.018 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
  in block %bb1.i.i.i:
    >   %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb24.i.i.i ]
    >   %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb24.i.i.i ]
    >   %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %bucket_mask.i.i.i
    >   %_20.i.i.i = getelementptr inbounds nuw i8, ptr %_22.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
    >   %dst.sroa.0.0.copyload.i17.i.i = load <16 x i8>, ptr %_20.i.i.i, align 1, !noalias !50
    >   %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, splat (i8 -1)
    >   %38 = bitcast <16 x i1> %37 to i16
    >   %has_empty.not.i.i.i = icmp eq i16 %38, 0
    >   %39 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, %36
    >   %40 = bitcast <16 x i1> %39 to i16
    >   %41 = icmp eq i16 %40, 0
    >   br i1 %41, label %bb4.i.i.i, label %bb16.i.i.i, !prof !6
    <   %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb20.i.i.i ]
    <   %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb20.i.i.i ]
    <   %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %_16.i.i.i
    <   %_17.i.i.i = getelementptr inbounds nuw i8, ptr %_19.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
    <   %dst.sroa.0.0.copyload.i18.i.i = load <16 x i8>, ptr %_17.i.i.i, align 1, !noalias !48
    <   %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i18.i.i, %36
    <   %38 = bitcast <16 x i1> %37 to i16
    <   br label %bb2.i.i.i
  in block %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i / %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i:
    >   %48 = phi ptr [ %43, %bb16.i.i.i ], [ null, %bb4.i.i.i ]
    >   %.not.i = icmp eq ptr %48, null
    >   %v.i = getelementptr inbounds i8, ptr %48, i64 -8
    >   %_0.sroa.0.1.i = select i1 %.not.i, ptr null, ptr %v.i
    <   %_0.sroa.3.0.i.i.i = phi i64 [ %index.i.i.i, %bb10.i.i.i ], [ undef, %bb11.i.i.i ]
    <   %_18.i.i = sub nsw i64 0, %_0.sroa.3.0.i.i.i
    <   %48 = getelementptr inbounds { i64, i64 }, ptr %_19.i.i.i, i64 %_18.i.i
    <   %_0.sroa.0.0.i.i = select i1 %.not.i.not.i.i, ptr null, ptr %48
    <   %v.i = getelementptr inbounds i8, ptr %_0.sroa.0.0.i.i, i64 -8
    <   %_0.sroa.0.1.i = select i1 %.not.i.not.i.i, ptr null, ptr %v.i
        %_28 = add nuw nsw i64 %iter4.sroa.0.019, 1
        %49 = add nsw i64 %iter2.sroa.0.018, -1
        call void @llvm.lifetime.start.p0(ptr nonnull %0)
    >   store ptr %_0.sroa.0.1.i, ptr %0, align 8
    <   store ptr %_0.sroa.0.1.i, ptr %0, align 8

@0xdeafbeef
Copy link
Copy Markdown
Contributor Author

Ok, I've started cleaning up criterion bench suite and found that unfortunately this change only improves numbers for nightly feature.

So my guess is that the nightly feature makes hashbrown use the likely and unlikely intrinsics instead of polyfills, and with my changes this results in tighter codegen.

Without nightly feature it also gives up-to 20%, but in the opposite direction :)

Thanks, and sorry for the noise!
I'll share the suite a bit later.

case base (ns) work (ns) delta
clone_from_large 2397.85 2395.22 -0.11%
clone_from_small 15.95 15.99 +0.25%
clone_large 2419.14 2420.28 +0.05%
clone_small 19.66 19.76 +0.53%
grow_insert_foldhash_highbits 11997.57 11892.50 -0.88%
grow_insert_foldhash_random 14974.83 15035.60 +0.41%
grow_insert_foldhash_serial 13897.03 13972.21 +0.54%
grow_insert_std_highbits 29832.64 29013.84 -2.74%
grow_insert_std_random 29790.52 29205.39 -1.96%
grow_insert_std_serial 29704.48 29138.58 -1.91%
insert_erase_foldhash_highbits 8325.18 8297.00 -0.34%
insert_erase_foldhash_random 8632.92 8628.69 -0.05%
insert_erase_foldhash_serial 8546.35 8612.16 +0.77%
insert_erase_std_highbits 20593.51 20644.33 +0.25%
insert_erase_std_random 20583.46 20719.48 +0.66%
insert_erase_std_serial 20450.24 20455.20 +0.02%
insert_foldhash_highbits 8037.78 8063.86 +0.32%
insert_foldhash_random 8105.20 8093.70 -0.14%
insert_foldhash_serial 7687.46 7692.17 +0.06%
insert_std_highbits 14092.63 14146.07 +0.38%
insert_std_random 14549.32 14482.59 -0.46%
insert_std_serial 14209.41 14244.03 +0.24%
iter_foldhash_highbits 544.72 545.86 +0.21%
iter_foldhash_random 546.06 547.41 +0.25%
iter_foldhash_serial 545.67 547.64 +0.36%
iter_std_highbits 543.59 545.82 +0.41%
iter_std_random 544.30 544.78 +0.09%
iter_std_serial 543.60 544.79 +0.22%
loadfactor_lookup_14500 937.18 1037.88 +10.75%
loadfactor_lookup_16500 936.18 1038.76 +10.96%
loadfactor_lookup_18500 937.83 1035.83 +10.45%
loadfactor_lookup_20500 934.93 1040.30 +11.27%
loadfactor_lookup_22500 936.50 1036.40 +10.67%
loadfactor_lookup_24500 939.08 1041.53 +10.91%
loadfactor_lookup_26500 939.31 1042.90 +11.03%
loadfactor_lookup_28500 939.62 1041.36 +10.83%
loadfactor_lookup_fail_14500 844.12 834.90 -1.09%
loadfactor_lookup_fail_16500 892.25 887.16 -0.57%
loadfactor_lookup_fail_18500 973.96 960.65 -1.37%
loadfactor_lookup_fail_20500 1102.00 1095.21 -0.62%
loadfactor_lookup_fail_22500 1347.93 1369.53 +1.60%
loadfactor_lookup_fail_24500 1827.08 1873.22 +2.53%
loadfactor_lookup_fail_26500 2758.85 2833.50 +2.71%
loadfactor_lookup_fail_28500 4443.26 4417.72 -0.57%
lookup_fail_foldhash_highbits 1632.38 1541.58 -5.56%
lookup_fail_foldhash_random 1714.53 1707.67 -0.40%
lookup_fail_foldhash_serial 1577.81 1504.75 -4.63%
lookup_fail_std_highbits 8655.76 8339.67 -3.65%
lookup_fail_std_random 8520.04 8609.68 +1.05%
lookup_fail_std_serial 8623.13 8313.63 -3.59%
lookup_foldhash_highbits 1500.92 1657.26 +10.42%
lookup_foldhash_random 1627.71 1825.31 +12.14%
lookup_foldhash_serial 1451.98 1735.49 +19.53%
lookup_std_highbits 8560.06 8954.85 +4.61%
lookup_std_random 8777.84 7771.27 -11.47%
lookup_std_serial 8400.87 7388.00 -12.06%
rehash_in_place 118804.85 119256.62 +0.38%

@0xdeafbeef 0xdeafbeef closed this Mar 28, 2026
@0xdeafbeef
Copy link
Copy Markdown
Contributor Author

https://github.com/0xdeafbeef/hashbrown-benches

@clarfonthey
Copy link
Copy Markdown
Contributor

Would love to merge those benchmark updates to the repo if you're willing to adapt them for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants