perf(raw_table): optimize find_inner with manual bitmask drain by 0xdeafbeef · Pull Request #707 · rust-lang/hashbrown

0xdeafbeef · 2026-03-25T19:34:41Z

This pr reduces overhead in a hot RawTable probe loops. The main improvement shows up in find_inner()

It replaces the iterator based walk over the tag match mask with a manual lowest_set_bit / remove_lowest_bit drain, and adds BitMask::normalize_for_iteration() so that drain stays correct on backends where one logical match spans multiple raw bits.

I've tried to apply same optimisations to find_or_find_insert_index_inner(), but they gave no measurable differences, so I've left the original for loop. And in contrast hoisting in find_or_find_insert_index_inner gives measurable 4% difference, so it's preserved, but removed from find_inner.

Hashing, probing behavior and observable side effects are unchanged.

I think that improvement come from the better codegen around the hot dyn FnMut predicate call: less probe state stays live across the candidate walk, and the loop operates on cached locals instead of repeatedly reloading from self.

Benches

Summary:

successful lookups improve by roughly 18–25%
failed lookups improve by roughly 3–13% in most cases
non-lookup benches are mostly flat

case                             master_ns      patch_ns   delta_pct
clone_from_large                   2939.25       2949.47        0.35
clone_from_small                     19.67         19.46       -1.07
clone_large                        2967.42       2960.48       -0.23
clone_small                          24.17         24.40        0.92
grow_insert_highbits              18896.80      19079.76        0.97
grow_insert_random                18743.52      18742.87       -0.00
grow_insert_serial                16797.08      16847.38        0.30
insert_erase_highbits             12540.49      12447.08       -0.74
insert_erase_random               10689.10      10461.05       -2.13
insert_erase_serial               10460.40      10209.30       -2.40
insert_highbits                   11443.34      11370.29       -0.64
insert_random                     10290.13      10322.59        0.32
insert_serial                     10215.50      10157.46       -0.57
iter_highbits                       656.22        658.88        0.41
iter_random                         655.23        659.15        0.60
iter_serial                         656.92        655.08       -0.28
loadfactor_lookup_14500            1251.21       1015.28      -18.86
loadfactor_lookup_16500            1240.77       1017.01      -18.03
loadfactor_lookup_18500            1256.04       1023.11      -18.54
loadfactor_lookup_20500            1246.81       1019.73      -18.21
loadfactor_lookup_22500            1244.30       1018.81      -18.12
loadfactor_lookup_24500            1249.99       1013.97      -18.88
loadfactor_lookup_26500            1253.29       1016.19      -18.92
loadfactor_lookup_28500            1242.99       1023.86      -17.63
loadfactor_lookup_fail_14500       1142.94       1029.81       -9.90
loadfactor_lookup_fail_16500       1205.04       1098.98       -8.80
loadfactor_lookup_fail_18500       1295.55       1209.64       -6.63
loadfactor_lookup_fail_20500       1460.41       1381.45       -5.41
loadfactor_lookup_fail_22500       1779.98       1721.94       -3.26
loadfactor_lookup_fail_24500       2360.56       2318.27       -1.79
loadfactor_lookup_fail_26500       3495.25       3488.16       -0.20
loadfactor_lookup_fail_28500       5386.85       5485.60        1.83
lookup_fail_highbits               2157.43       1881.92      -12.77
lookup_fail_random                 2284.68       2075.13       -9.17
lookup_fail_serial                 2107.32       1865.77      -11.46
lookup_highbits                    2081.42       1558.49      -25.12
lookup_random                      2259.22       1748.12      -22.62
lookup_serial                      1967.77       1498.16      -23.87
rehash_in_place                  142606.74     141617.22       -0.69

Tests

Added regression tests covering:

lookup through a tombstone collision cluster
duplicate insert preferring an existing entry over a tombstone
predicate call count for a single matching bucket
lookup continuing past a full nonmatching group
bitmask normalization / manual drain yielding the expected positions

I also saw a strange regression when running the crate's nightly #[bench] tests, but separate harness using Criterion and valgrind did not reproduce it for grow_insert_foldhash_highbits (valgrind gave the same cycle by cycle numbers). I can push it as a separate repo if needed( basically it's a copy-paste of the benches ported to Criterion)

asm diff for find inner(pretty hard to cleanly compare them because of inlining):

@@ -1,18 +1,18 @@
-hashbrown_path_compare::master_backend::lookup_serial:
+hashbrown_path_compare::clean_backend::lookup_serial:
 	push rbp
 	push r15
 	push r14
 	push r13
 	push r12
 	push rbx
-	sub rsp, 72
-	mov qword ptr [rsp], rdi
+	sub rsp, 56
+	mov rbx, rdi
 	call qword ptr [rip + foldhash::seed::gen_per_hasher_seed@GOTPCREL]
 	mov r13, qword ptr [rip + foldhash::seed::global::GLOBAL_SEED_STORAGE@GOTPCREL]
 	movzx ecx, byte ptr [r13 + 48]
 	cmp cl, 2
-	jne .LBB160_1
-.LBB160_2:
+	jne .LBB130_1
+.LBB130_2:
 	mov qword ptr [rsp + 48], rax
 	movups xmm0, xmmword ptr [rip + .Lanon.766d3937d4fd495137a5bc8d3881f52a.31]
 	movaps xmmword ptr [rsp + 16], xmm0
@@ -20,169 +20,159 @@
 	movdqa xmmword ptr [rsp + 32], xmm0
 	xor r14d, r14d
 	lea r15, [rsp + 16]
-	mov rbx, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
-	jmp .LBB160_3
-.LBB160_6:
+	mov r12, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
+	jmp .LBB130_3
+.LBB130_6:
 	inc r14
 	cmp r14, 1000
-	je .LBB160_7
-.LBB160_3:
+	je .LBB130_7
+.LBB130_3:
 	mov rdi, r15
 	mov rsi, r14
 	mov rdx, r14
-	call <hashbrown_master_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>::insert
+	call <hashbrown_clean_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>::insert
 	test rax, rax
-	je .LBB160_6
-	lock addqword ptr [rbx], rdx
-	jmp .LBB160_6
-.LBB160_7:
-	cmp qword ptr [rsp], 0
-	je .LBB160_14
+	je .LBB130_6
+	lock addqword ptr [r12], rdx
+	jmp .LBB130_6
+.LBB130_7:
+	test rbx, rbx
+	je .LBB130_14
 	cmp qword ptr [rsp + 40], 0
-	je .LBB160_10
+	je .LBB130_10
 	mov rcx, qword ptr [rsp + 48]
-	mov r10, qword ptr [rsp + 16]
+	mov rsi, qword ptr [rsp + 16]
 	mov rdi, qword ptr [rsp + 24]
 	xor r8d, r8d
 	pcmpeqd xmm0, xmm0
 	lea r9, [rsp + 8]
+.LBB130_24:
+	mov r10d, 1000
+	xor r11d, r11d
+.LBB130_25:
+	mov rax, rcx
+	xor rax, r11
+	mul qword ptr [r13]
+	xor rdx, rax
+	mov rax, rdx
+	shr rax, 57
+	movd xmm1, eax
+	punpcklbw xmm1, xmm1
+	pshuflw xmm1, xmm1, 0
+	pshufd xmm1, xmm1, 68
 	xor eax, eax
-	jmp .LBB160_24
-.LBB160_23:
-	mov rax, qword ptr [rsp + 64]
-	inc rax
-	cmp rax, qword ptr [rsp]
-	je .LBB160_14
-.LBB160_24:
-	mov qword ptr [rsp + 64], rax
-	mov r11d, 1000
-	xor r14d, r14d
-	jmp .LBB160_25
-.LBB160_26:
+.LBB130_26:
 	and rdx, rdi
-	movdqu xmm2, xmmword ptr [r10 + rdx]
+	movdqu xmm2, xmmword ptr [rsi + rdx]
 	movdqa xmm3, xmm2
 	pcmpeqb xmm3, xmm1
-	pmovmskb r12d, xmm3
-.LBB160_27:
-	mov r15d, r12d
-	test r12w, r12w
-	je .LBB160_31
-	lea r12d, [r15 - 1]
-	rep bsfebp, r15d
-	and r12d, r15d
-	add rbp, rdx
-	and rbp, rdi
-	mov rbx, rbp
-	shl rbx, 4
-	mov rsi, r10
-	sub rsi, rbx
-	cmp r14, qword ptr [rsi - 16]
-	je .LBB160_29
-	jmp .LBB160_27
-.LBB160_31:
+	pmovmskb r14d, xmm3
+	test r14d, r14d
+	jne .LBB130_31
+.LBB130_27:
 	pcmpeqb xmm2, xmm0
-	pmovmskb esi, xmm2
-	test esi, esi
-	jne .LBB160_29
+	pmovmskb ebp, xmm2
+	test bp, bp
+	jne .LBB130_28
 	add rdx, rax
 	add rdx, 16
 	add rax, 16
-	jmp .LBB160_26
-.LBB160_29:
+	jmp .LBB130_26
+.LBB130_31:
+	rep bsfebp, r14d
+	add rbp, rdx
+	and rbp, rdi
 	shl rbp, 4
-	mov rax, r10
-	sub rax, rbp
-	test r15w, r15w
-	cmove rax, r8
-	add rax, -8
-	test r15w, r15w
-	cmove rax, r8
-	inc r14
+	mov r15, rsi
+	sub r15, rbp
+	cmp r11, qword ptr [r15 - 16]
+	je .LBB130_29
+	lea ebp, [r14 - 1]
+	and bp, r14w
+	mov r14d, ebp
+	jne .LBB130_31
+	jmp .LBB130_27
+.LBB130_28:
+	xor r15d, r15d
+.LBB130_29:
+	lea rax, [r15 - 8]
+	test r15, r15
+	cmove rax, r15
+	inc r11
 	mov qword ptr [rsp + 8], rax
 	#APP
 	#NO_APP
-	dec r11
-	je .LBB160_23
-.LBB160_25:
-	mov rax, rcx
-	xor rax, r14
-	mul qword ptr [r13]
-	xor rdx, rax
-	mov rax, rdx
-	shr rax, 57
-	movd xmm1, eax
-	punpcklbw xmm1, xmm1
-	pshuflw xmm1, xmm1, 0
-	pshufd xmm1, xmm1, 68
-	xor eax, eax
-	jmp .LBB160_26
-.LBB160_10:
+	dec r10
+	jne .LBB130_25
+	inc r8
+	cmp r8, rbx
+	jne .LBB130_24
+	jmp .LBB130_14
+.LBB130_10:
 	xor eax, eax
 	lea rcx, [rsp + 8]
-.LBB160_11:
+.LBB130_11:
 	mov rdx, -1000
-.LBB160_12:
+.LBB130_12:
 	mov qword ptr [rsp + 8], 0
 	#APP
 	#NO_APP
 	inc rdx
-	jne .LBB160_12
+	jne .LBB130_12
 	inc rax
-	cmp rax, qword ptr [rsp]
-	jne .LBB160_11
-.LBB160_14:
-	mov r10, qword ptr [rip + hashbrown_path_compare::SIDE_EFFECT@GOTPCREL]
-	mov rbx, qword ptr [r10]
+	cmp rax, rbx
+	jne .LBB130_11
+.LBB130_14:
+	mov rbx, qword ptr [r12]
 	mov rsi, qword ptr [rsp + 24]
 	test rsi, rsi
-	je .LBB160_22
+	je .LBB130_22
 	mov rax, qword ptr [rsp + 40]
 	test rax, rax
-	je .LBB160_20
+	je .LBB130_20
 	mov rcx, qword ptr [rsp + 16]
 	movdqa xmm0, xmmword ptr [rcx]
 	lea rdx, [rcx + 16]
 	pmovmskb edi, xmm0
 	not edi
-	jmp .LBB160_17
-.LBB160_19:
+	jmp .LBB130_17
+.LBB130_19:
 	rep bsfr8d, edi
 	shl r8d, 4
 	mov r9, rcx
 	sub r9, r8
 	mov r8, qword ptr [r9 - 8]
-	lock addqword ptr [r10], r8
+	lock addqword ptr [r12], r8
 	lea r8d, [rdi - 1]
 	and r8d, edi
 	mov edi, r8d
 	dec rax
-	je .LBB160_20
-.LBB160_17:
+	je .LBB130_20
+.LBB130_17:
 	test di, di
-	jne .LBB160_19
-.LBB160_18:
+	jne .LBB130_19
+.LBB130_18:
 	movdqa xmm0, xmmword ptr [rdx]
 	add rcx, -256
 	add rdx, 16
 	pmovmskb edi, xmm0
 	xor edi, 65535
-	je .LBB160_18
-	jmp .LBB160_19
-.LBB160_20:
+	je .LBB130_18
+	jmp .LBB130_19
+.LBB130_20:
 	mov rax, rsi
 	shl rax, 4
 	add rsi, rax
 	add rsi, 33
-	je .LBB160_22
+	je .LBB130_22
 	mov rdi, qword ptr [rsp + 16]
 	sub rdi, rax
 	add rdi, -16
 	mov edx, 16
 	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
-.LBB160_22:
+.LBB130_22:
 	mov rax, rbx
-	add rsp, 72
+	add rsp, 56
 	pop rbx
 	pop r12
 	pop r13
@@ -190,11 +180,11 @@
 	pop r15
 	pop rbp
 	ret
-.LBB160_1:
+.LBB130_1:
 	mov r14, rax
 	call qword ptr [rip + <foldhash::seed::global::GlobalSeed>::init_slow@GOTPCREL]
 	mov rax, r14
-	jmp .LBB160_2
+	jmp .LBB130_2
 	mov rbx, rax
 	lea rdi, [rsp + 16]
 	call core::ptr::drop_in_place::<hashbrown_clean_compare::map::HashMap<usize, hashbrown_path_compare::DropType>>

Looks like loop become tighter due to manual iterator + more state lives in registers

clarfonthey · 2026-03-26T07:19:14Z

Actually just resolving my comments here because I didn't read the issue correctly.

I guess that it's unfortunate that you have to add in extra methods that must be called correctly, but if it's a performance win, that's okay.

0xdeafbeef · 2026-03-26T11:08:31Z

Actually just resolving my comments here because I didn't read the issue correctly.

I guess that it's unfortunate that you have to add in extra methods that must be called correctly, but if it's a performance win, that's okay.

Yep, unfortunate that for loop gives a huge penalty.

 - lookup_serial: +12.46%
  - lookup_highbits: +13.50%
  - lookup_random: +11.49%
  - loadfactor_lookup_14500: +7.49%
  - loadfactor_lookup_16500: +7.36%
  - loadfactor_lookup_20500: +7.50%
  - loadfactor_lookup_26500: +7.83%

There is still a win when using it, but it's not as big.

It would be nice if someone who knows how LLVM works could take a look. My guess is that while let is easier for the optimizer.

I can share cargo asm diff for this case

Change `RawTable::find_inner()` to walk `match_tag()` results with `lowest_set_bit()` / `remove_lowest_bit()` instead of the iterator path, and compute `has_empty` before checking matching buckets. Add `BitMask::normalize_for_iteration()` so manual draining stays correct on backends where one logical match may occupy multiple raw bits.

Amanieu · 2026-03-26T11:58:47Z

Could you also include an LLVM IR diff? It might be easier to read than an asm diff.

0xdeafbeef · 2026-03-26T12:48:40Z

Could you also include an LLVM IR diff? It might be easier to read than an asm diff.

llvm-diff or just a diff of lir?

0xdeafbeef · 2026-03-26T13:15:16Z

compared

   pub fn lookup_serial(iterations: usize) -> usize {
                let mut m: FoldHashMap<usize, DropType> = FoldHashMap::default();
                for i in (0..).take(SIZE) {
                    m.insert(i, DropType(i));
                }

                for _ in 0..iterations {
                    for i in (0..).take(SIZE) {
                        black_box(m.get(&i));
                    }
                }
            }

where hashmap is from master or from my fork as path dependency.

obtained with cargo rustc --lib --release -- --emit=llvm-ir
+

llvm-extract -S --recursive \
       --func=_RNvNtCs7m6UUWt6V5j_22hashbrown_path_compare13clean_backend13lookup_serial

than llvm-diff

in function FUNC_reserve_rehash:
  in block %bb15.i:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !121
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !117

in function FUNC_fallible_with_capacity:
  in block %bb11:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17
  in block %bb13.i:
    >   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_1817aa4d0833b30e03989513e9b3c01a) #17, !noalias !5
    <   tail call void @_RNvNtCsc7xFPsoYjqI_4core9panicking9panic_fmt(ptr noundef nonnull @alloc_70fa19237e466d7d1e23911705586cf6, ptr noundef nonnull inttoptr (i64 57 to ptr), ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_684637c4c4086c221313203cd7ec9cff) #17, !noalias !5

in function FUNC_lookup_serial:
  in block %bb20:
    >   %iter4.sroa.0.021 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
    >   %iter2.sroa.0.020 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i ]
    <   %iter4.sroa.0.019 = phi i64 [ 0, %bb17 ], [ %_28, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
    <   %iter2.sroa.0.018 = phi i64 [ 1000, %bb17 ], [ %49, %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i ]
  in block %bb1.i.i.i:
    >   %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb24.i.i.i ]
    >   %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb24.i.i.i ]
    >   %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %bucket_mask.i.i.i
    >   %_20.i.i.i = getelementptr inbounds nuw i8, ptr %_22.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
    >   %dst.sroa.0.0.copyload.i17.i.i = load <16 x i8>, ptr %_20.i.i.i, align 1, !noalias !50
    >   %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, splat (i8 -1)
    >   %38 = bitcast <16 x i1> %37 to i16
    >   %has_empty.not.i.i.i = icmp eq i16 %38, 0
    >   %39 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i17.i.i, %36
    >   %40 = bitcast <16 x i1> %39 to i16
    >   %41 = icmp eq i16 %40, 0
    >   br i1 %41, label %bb4.i.i.i, label %bb16.i.i.i, !prof !6
    <   %probe_seq1.sroa.0.0.i.i.i = phi i64 [ 0, %bb20 ], [ %46, %bb20.i.i.i ]
    <   %hash.pn.i.i.i = phi i64 [ %34, %bb20 ], [ %47, %bb20.i.i.i ]
    <   %probe_seq.sroa.0.0.i.i.i = and i64 %hash.pn.i.i.i, %_16.i.i.i
    <   %_17.i.i.i = getelementptr inbounds nuw i8, ptr %_19.i.i.i, i64 %probe_seq.sroa.0.0.i.i.i
    <   %dst.sroa.0.0.copyload.i18.i.i = load <16 x i8>, ptr %_17.i.i.i, align 1, !noalias !48
    <   %37 = icmp eq <16 x i8> %dst.sroa.0.0.copyload.i18.i.i, %36
    <   %38 = bitcast <16 x i1> %37 to i16
    <   br label %bb2.i.i.i
  in block %_RINvMs6_NtCs8UPypCBVm3R_24hashbrown_master_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB17_E0EB19_.exit.i / %_RINvMs6_NtCsbTX5VcptrMo_23hashbrown_clean_compare3rawINtB6_8RawTableTjNtCs7m6UUWt6V5j_22hashbrown_path_compare8DropTypeEE4findNCINvNtB8_3map14equivalent_keyjjB16_E0EB18_.exit.i:
    >   %48 = phi ptr [ %43, %bb16.i.i.i ], [ null, %bb4.i.i.i ]
    >   %.not.i = icmp eq ptr %48, null
    >   %v.i = getelementptr inbounds i8, ptr %48, i64 -8
    >   %_0.sroa.0.1.i = select i1 %.not.i, ptr null, ptr %v.i
    <   %_0.sroa.3.0.i.i.i = phi i64 [ %index.i.i.i, %bb10.i.i.i ], [ undef, %bb11.i.i.i ]
    <   %_18.i.i = sub nsw i64 0, %_0.sroa.3.0.i.i.i
    <   %48 = getelementptr inbounds { i64, i64 }, ptr %_19.i.i.i, i64 %_18.i.i
    <   %_0.sroa.0.0.i.i = select i1 %.not.i.not.i.i, ptr null, ptr %48
    <   %v.i = getelementptr inbounds i8, ptr %_0.sroa.0.0.i.i, i64 -8
    <   %_0.sroa.0.1.i = select i1 %.not.i.not.i.i, ptr null, ptr %v.i
        %_28 = add nuw nsw i64 %iter4.sroa.0.019, 1
        %49 = add nsw i64 %iter2.sroa.0.018, -1
        call void @llvm.lifetime.start.p0(ptr nonnull %0)
    >   store ptr %_0.sroa.0.1.i, ptr %0, align 8
    <   store ptr %_0.sroa.0.1.i, ptr %0, align 8

0xdeafbeef · 2026-03-28T13:45:16Z

Ok, I've started cleaning up criterion bench suite and found that unfortunately this change only improves numbers for nightly feature.

So my guess is that the nightly feature makes hashbrown use the likely and unlikely intrinsics instead of polyfills, and with my changes this results in tighter codegen.

Without nightly feature it also gives up-to 20%, but in the opposite direction :)

Thanks, and sorry for the noise!
I'll share the suite a bit later.

case	base (ns)	work (ns)	delta
clone_from_large	2397.85	2395.22	-0.11%
clone_from_small	15.95	15.99	+0.25%
clone_large	2419.14	2420.28	+0.05%
clone_small	19.66	19.76	+0.53%
grow_insert_foldhash_highbits	11997.57	11892.50	-0.88%
grow_insert_foldhash_random	14974.83	15035.60	+0.41%
grow_insert_foldhash_serial	13897.03	13972.21	+0.54%
grow_insert_std_highbits	29832.64	29013.84	-2.74%
grow_insert_std_random	29790.52	29205.39	-1.96%
grow_insert_std_serial	29704.48	29138.58	-1.91%
insert_erase_foldhash_highbits	8325.18	8297.00	-0.34%
insert_erase_foldhash_random	8632.92	8628.69	-0.05%
insert_erase_foldhash_serial	8546.35	8612.16	+0.77%
insert_erase_std_highbits	20593.51	20644.33	+0.25%
insert_erase_std_random	20583.46	20719.48	+0.66%
insert_erase_std_serial	20450.24	20455.20	+0.02%
insert_foldhash_highbits	8037.78	8063.86	+0.32%
insert_foldhash_random	8105.20	8093.70	-0.14%
insert_foldhash_serial	7687.46	7692.17	+0.06%
insert_std_highbits	14092.63	14146.07	+0.38%
insert_std_random	14549.32	14482.59	-0.46%
insert_std_serial	14209.41	14244.03	+0.24%
iter_foldhash_highbits	544.72	545.86	+0.21%
iter_foldhash_random	546.06	547.41	+0.25%
iter_foldhash_serial	545.67	547.64	+0.36%
iter_std_highbits	543.59	545.82	+0.41%
iter_std_random	544.30	544.78	+0.09%
iter_std_serial	543.60	544.79	+0.22%
loadfactor_lookup_14500	937.18	1037.88	+10.75%
loadfactor_lookup_16500	936.18	1038.76	+10.96%
loadfactor_lookup_18500	937.83	1035.83	+10.45%
loadfactor_lookup_20500	934.93	1040.30	+11.27%
loadfactor_lookup_22500	936.50	1036.40	+10.67%
loadfactor_lookup_24500	939.08	1041.53	+10.91%
loadfactor_lookup_26500	939.31	1042.90	+11.03%
loadfactor_lookup_28500	939.62	1041.36	+10.83%
loadfactor_lookup_fail_14500	844.12	834.90	-1.09%
loadfactor_lookup_fail_16500	892.25	887.16	-0.57%
loadfactor_lookup_fail_18500	973.96	960.65	-1.37%
loadfactor_lookup_fail_20500	1102.00	1095.21	-0.62%
loadfactor_lookup_fail_22500	1347.93	1369.53	+1.60%
loadfactor_lookup_fail_24500	1827.08	1873.22	+2.53%
loadfactor_lookup_fail_26500	2758.85	2833.50	+2.71%
loadfactor_lookup_fail_28500	4443.26	4417.72	-0.57%
lookup_fail_foldhash_highbits	1632.38	1541.58	-5.56%
lookup_fail_foldhash_random	1714.53	1707.67	-0.40%
lookup_fail_foldhash_serial	1577.81	1504.75	-4.63%
lookup_fail_std_highbits	8655.76	8339.67	-3.65%
lookup_fail_std_random	8520.04	8609.68	+1.05%
lookup_fail_std_serial	8623.13	8313.63	-3.59%
lookup_foldhash_highbits	1500.92	1657.26	+10.42%
lookup_foldhash_random	1627.71	1825.31	+12.14%
lookup_foldhash_serial	1451.98	1735.49	+19.53%
lookup_std_highbits	8560.06	8954.85	+4.61%
lookup_std_random	8777.84	7771.27	-11.47%
lookup_std_serial	8400.87	7388.00	-12.06%
rehash_in_place	118804.85	119256.62	+0.38%

0xdeafbeef · 2026-03-28T14:55:33Z

https://github.com/0xdeafbeef/hashbrown-benches

clarfonthey · 2026-03-28T18:31:50Z

Would love to merge those benchmark updates to the repo if you're willing to adapt them for that.

0xdeafbeef changed the title ~~perf: hoist invariant state in RawTable probe loops~~ perf(raw_table): optimize find_inner with manual bitmask drain Mar 25, 2026

0xdeafbeef force-pushed the faster-gets branch from 00d4dd3 to bbbc2fd Compare March 25, 2026 20:44

clarfonthey reviewed Mar 26, 2026

View reviewed changes

Comment thread src/raw.rs

clarfonthey reviewed Mar 26, 2026

View reviewed changes

Comment thread src/control/bitmask.rs

clarfonthey reviewed Mar 26, 2026

View reviewed changes

Comment thread src/control/bitmask.rs Outdated

0xdeafbeef force-pushed the faster-gets branch from bbbc2fd to e4ef161 Compare March 26, 2026 11:37

0xdeafbeef closed this Mar 28, 2026

0xdeafbeef mentioned this pull request Apr 3, 2026

chore: move to criterion benches from nightly #709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(raw_table): optimize find_inner with manual bitmask drain#707

perf(raw_table): optimize find_inner with manual bitmask drain#707
0xdeafbeef wants to merge 1 commit intorust-lang:masterfrom
0xdeafbeef:faster-gets

0xdeafbeef commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

clarfonthey commented Mar 26, 2026

Uh oh!

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

Amanieu commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 28, 2026

Uh oh!

0xdeafbeef commented Mar 28, 2026

Uh oh!

clarfonthey commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0xdeafbeef commented Mar 25, 2026

Benches

Tests

Uh oh!

Uh oh!

Uh oh!

clarfonthey commented Mar 26, 2026

Uh oh!

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

Amanieu commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 26, 2026

Uh oh!

0xdeafbeef commented Mar 28, 2026

Uh oh!

0xdeafbeef commented Mar 28, 2026

Uh oh!

clarfonthey commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants