Skip to content

Improve on NEON path of no memmove#12

Merged
xjb714 merged 1 commit into
xjb714:mainfrom
Antares0982:neon-opt
Apr 27, 2026
Merged

Improve on NEON path of no memmove#12
xjb714 merged 1 commit into
xjb714:mainfrom
Antares0982:neon-opt

Conversation

@Antares0982

Copy link
Copy Markdown
Contributor

closes #3 , the memmove bottleneck is completely solved now.

This PR improves code performance on no memmove path, and also does a refactoring. The XJB_NO_MEMMOVE=0 path is not affected (generates same assembly).

According to my investigation, the design of the Apple Silicon load-store unit (LSU) results in little to no performance penalty for memmove operations; therefore this switch is disabled by default on Apple Silicon chips. The main optimizations in this PR are:

  1. 0x30302e30 is not a valid immediate on AArch64, which causes the compiler to emit two additional instructions. Revert it to 0x30303030 and restore the original write order.
-	ldr	q2, [x4, x1]
 	cmgt.16b	v1, v1, #0
+	ldr	q2, [x4, x1]
+	tbl.16b	v2, { v0 }, v2
 	shrn.8b	v1, v1, #4
 	fmov	x1, d1
 	rbit	x1, x1
@@ -316,10 +317,11 @@ LBB1_5:
 	csel	x15, x17, x1, eq
 	ldrb	w5, [x4, x5]
 	ldrb	w1, [x4, x15]
-	tbl.16b	v1, { v0 }, v2
 	add	x15, x0, x2
-	str	q1, [x15]
-	orr	w16, w16, #0x30303030
+	str	q2, [x15]
+	mov	w4, #11824                      ; =0x2e30
+	movk	w4, #12336, lsl #16
+	orr	w16, w16, w4
  1. The compiler cannot observe the potential optimization in memset(buf, '0', 8): reusing an existing vector register. Rewriting it explicitly as vst1q_s8((int8_t*)buf, vdupq_n_s8('0')) enables register reuse and removes a few instructions.

Other refactoring: code reuse, removal of comments and unreachable code, and the addition of an assume macro. No actual logic changes were introduced.

Performance on Apple M4:

  xjb                       min    4.20  P1    4.35  med    4.64  mean    4.64 ns/call  (sink=7868610120)
  xjb-new-nomove            min    4.22  P1    4.28  med    4.64  mean    4.63 ns/call  (sink=7868610120)
  xjb-old-nomove            min    4.28  P1    4.40  med    4.70  mean    4.70 ns/call  (sink=7868610120)

@xjb714 xjb714 merged commit 8cdf743 into xjb714:main Apr 27, 2026
3 checks passed
@Antares0982 Antares0982 deleted the neon-opt branch April 27, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memmove introduce performance bottleneck in fixed writer

2 participants