Improve on NEON path of no memmove by Antares0982 · Pull Request #12 · xjb714/xjb

Antares0982 · 2026-04-26T13:14:28Z

closes #3 , the memmove bottleneck is completely solved now.

This PR improves code performance on no memmove path, and also does a refactoring. The XJB_NO_MEMMOVE=0 path is not affected (generates same assembly).

According to my investigation, the design of the Apple Silicon load-store unit (LSU) results in little to no performance penalty for memmove operations; therefore this switch is disabled by default on Apple Silicon chips. The main optimizations in this PR are:

0x30302e30 is not a valid immediate on AArch64, which causes the compiler to emit two additional instructions. Revert it to 0x30303030 and restore the original write order.

-	ldr	q2, [x4, x1]
 	cmgt.16b	v1, v1, #0
+	ldr	q2, [x4, x1]
+	tbl.16b	v2, { v0 }, v2
 	shrn.8b	v1, v1, #4
 	fmov	x1, d1
 	rbit	x1, x1
@@ -316,10 +317,11 @@ LBB1_5:
 	csel	x15, x17, x1, eq
 	ldrb	w5, [x4, x5]
 	ldrb	w1, [x4, x15]
-	tbl.16b	v1, { v0 }, v2
 	add	x15, x0, x2
-	str	q1, [x15]
-	orr	w16, w16, #0x30303030
+	str	q2, [x15]
+	mov	w4, #11824                      ; =0x2e30
+	movk	w4, #12336, lsl #16
+	orr	w16, w16, w4

The compiler cannot observe the potential optimization in memset(buf, '0', 8): reusing an existing vector register. Rewriting it explicitly as vst1q_s8((int8_t*)buf, vdupq_n_s8('0')) enables register reuse and removes a few instructions.

Other refactoring: code reuse, removal of comments and unreachable code, and the addition of an assume macro. No actual logic changes were introduced.

Performance on Apple M4:

  xjb                       min    4.20  P1    4.35  med    4.64  mean    4.64 ns/call  (sink=7868610120)
  xjb-new-nomove            min    4.22  P1    4.28  med    4.64  mean    4.63 ns/call  (sink=7868610120)
  xjb-old-nomove            min    4.28  P1    4.40  med    4.70  mean    4.70 ns/call  (sink=7868610120)

Improve on NEON path

1293053

Antares0982 force-pushed the neon-opt branch from 85f169c to 1293053 Compare April 27, 2026 08:48

xjb714 merged commit 8cdf743 into xjb714:main Apr 27, 2026
3 checks passed

Antares0982 deleted the neon-opt branch April 27, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve on NEON path of no memmove#12

Improve on NEON path of no memmove#12
xjb714 merged 1 commit into
xjb714:mainfrom
Antares0982:neon-opt

Antares0982 commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Antares0982 commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants