Skip to content

Improve successful find speed by 1 cycle on Aarch64#9726

Open
Nicoshev wants to merge 2 commits intofacebook:masterfrom
Nicoshev:export-D94030304
Open

Improve successful find speed by 1 cycle on Aarch64#9726
Nicoshev wants to merge 2 commits intofacebook:masterfrom
Nicoshev:export-D94030304

Conversation

@Nicoshev
Copy link
Contributor

Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]

After the changes, we verified the and is omitted:

clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗

Differential Revision: D94030304

Summary:

X-link: facebook/folly#2588

Idea is to use instruction MATCH right after loading the tags.
MATCH checks if any byte loaded from memory is equal to the needle, setting flags accordingly.
We can use it to quickly branch if no byte is equal to the needle.

The emitted asm looks like this:

  2c3c70:	a40e4141 	ld1b	{z1.b}, p0/z, [x10, x14]
  2c3c74:	45208021 	match	p1.b, p0/z, z1.b, z0.b
  2c3c78:	540001e0 	b.eq	2c3cb4 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x348>  // b.none
  2c3c7c:	6e208c21 	cmeq	v1.16b, v1.16b, v0.16b
  2c3c80:	910041ae 	add	x14, x13, #0x10
  2c3c84:	05800701 	and	z1.b, z1.b, #0x11
  2c3c88:	0f0c8421 	shrn	v1.8b, v1.8h, facebook#4
  ...

Instruction cmeq is likely to be executed speculatevely alongside match.
I also tried using the output of match in a broadcast instruction: svdup_n_u8_z(outPred, 17);
The dup reoplaces the cmeq+and, but it still showed equal or slower than cmeq+and, further suggesting that the cmeq will execute alongside match. I'll ask ARM engineers if they recommend any sequence between the two.

In newer CPUs implementing SVE2.1, we will be able to move the predicate into a simd register.

Maybe code layout can be improved to be less ugly.

Benchmark shows ~10% reduction in find latency:

  Before:

  Find  f14node<NonSSOString, a[128]>[11]                    28.08ns    35.61M
  Find   f14val<NonSSOString, a[128]>[11]         98.784%    28.43ns    35.18M
  Find   f14vec<NonSSOString, a[128]>[11]         101.61%    27.64ns    36.18M
  ----------------------------------------------------------------------------
  Find  f14node<NonSSOString, a[1]>[11]                      28.07ns    35.62M
  Find   f14val<NonSSOString, a[1]>[11]           98.209%    28.58ns    34.99M
  Find   f14vec<NonSSOString, a[1]>[11]           101.54%    27.65ns    36.17M
  ----------------------------------------------------------------------------
  Find  f14node<std::string, a[128]>[11]                     28.07ns    35.62M
  Find   f14val<std::string, a[128]>[11]          97.935%    28.66ns    34.89M
  Find   f14vec<std::string, a[128]>[11]          101.42%    27.68ns    36.13M
  ----------------------------------------------------------------------------
  Find  f14node<std::string, a[1]>[11]                       28.05ns    35.65M
  Find   f14val<std::string, a[1]>[11]            97.259%    28.84ns    34.67M
  Find   f14vec<std::string, a[1]>[11]            100.40%    27.94ns    35.79M

  After:

  Find  f14node<NonSSOString, a[128]>[11]                    25.81ns    38.75M
  Find   f14val<NonSSOString, a[128]>[11]         99.176%    26.02ns    38.43M
  Find   f14vec<NonSSOString, a[128]>[11]         100.04%    25.80ns    38.76M
  ----------------------------------------------------------------------------
  Find  f14node<NonSSOString, a[1]>[11]                      25.81ns    38.75M
  Find   f14val<NonSSOString, a[1]>[11]           99.176%    26.02ns    38.43M
  Find   f14vec<NonSSOString, a[1]>[11]           100.02%    25.80ns    38.76M
  ----------------------------------------------------------------------------
  Find  f14node<std::string, a[128]>[11]                     26.33ns    37.98M
  Find   f14val<std::string, a[128]>[11]          101.23%    26.01ns    38.45M
  Find   f14vec<std::string, a[128]>[11]          95.719%    27.50ns    36.36M
  ----------------------------------------------------------------------------
  Find  f14node<std::string, a[1]>[11]                       26.36ns    37.93M
  Find   f14val<std::string, a[1]>[11]            101.80%    25.90ns    38.62M
  Find   f14vec<std::string, a[1]>[11]            96.690%    27.26ns    36.68M


Improvement is likely to be higher on dense maps, while lower on sparse maps without the tags in cache

Reviewed By: yfeldblum

Differential Revision: D93997423
Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the and is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗

Differential Revision: D94030304
@meta-codesync
Copy link

meta-codesync bot commented Feb 22, 2026

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94030304.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant