Skip to content

perf: optimize PartialEq implementation#54

Closed
klkvr wants to merge 1 commit into
mainfrom
klkvr/partial-eq-perf
Closed

perf: optimize PartialEq implementation#54
klkvr wants to merge 1 commit into
mainfrom
klkvr/partial-eq-perf

Conversation

@klkvr

@klkvr klkvr commented Feb 10, 2026

Copy link
Copy Markdown
Member

Due to most significant limbs being filled first comparison is suboptimal right now for smaller values

@codspeed-hq

codspeed-hq Bot commented Feb 10, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 29.41%

⚡ 4 improved benchmarks
✅ 102 untouched benchmarks
⏩ 111 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
eq/8 538.9 µs 416.4 µs +29.41%
eq/16 538.9 µs 416.4 µs +29.41%
eq/32 511.1 µs 416.4 µs +22.73%
eq/64 511.1 µs 416.4 µs +22.73%

Comparing klkvr/partial-eq-perf (8df38fb) with main (ff87f1c)

Open in CodSpeed

Footnotes

  1. 111 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@DaniPopes

Copy link
Copy Markdown
Member

@klkvr can you show asm diff

@klkvr

klkvr commented Feb 10, 2026

Copy link
Copy Markdown
Member Author

@DaniPopes amp got me this

--- a/a.s
+++ b/b.s
@@ -4,13 +4,32 @@
 	.type	nibbles_eq,@function
 nibbles_eq:
 	.cfi_startproc
-	vmovdqu	(%rdi), %ymm0
-	vmovq	32(%rdi), %xmm1
-	vmovq	32(%rsi), %xmm2
-	vpxor	%ymm2, %ymm1, %ymm1
-	vpxor	(%rsi), %ymm0, %ymm0
-	vpor	%ymm1, %ymm0, %ymm0
-	vptest	%ymm0, %ymm0
-	sete	%al
-	vzeroupper
+	mov	rcx, qword ptr [rdi]
+	cmp	rcx, qword ptr [rsi]
+	jne	.LBB0_4
+	mov	rax, qword ptr [rdi + 32]
+	cmp	rax, qword ptr [rsi + 32]
+	jne	.LBB0_4
+	mov	al, 1
+	cmp	rcx, 17
+	jae	.LBB0_6
+.LBB0_3:
 	ret
+.LBB0_6:
+	mov	rdx, qword ptr [rdi + 24]
+	cmp	rdx, qword ptr [rsi + 24]
+	jne	.LBB0_4
+	cmp	rcx, 33
+	jb	.LBB0_3
+	mov	rdx, qword ptr [rdi + 16]
+	cmp	rdx, qword ptr [rsi + 16]
+	jne	.LBB0_4
+	cmp	rcx, 49
+	jb	.LBB0_3
+	mov	rax, qword ptr [rdi + 8]
+	cmp	rax, qword ptr [rsi + 8]
+	sete	al
+	ret
+.LBB0_4:
+	xor	eax, eax
+	ret

@DaniPopes

Copy link
Copy Markdown
Member

currently it's branchless and doesn't matter how long the value is. with this change it's probably slightly faster for when the workload is only small values, even then i doubt it

@klkvr klkvr closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants