Improve successful find speed by 1 cycle on Aarch64 by Nicoshev · Pull Request #9726 · facebook/hhvm

Nicoshev · 2026-02-22T21:38:36Z

Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]

After the changes, we verified the and is omitted:

clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗

Differential Revision: D94030304

Summary: X-link: facebook/folly#2588 Idea is to use instruction MATCH right after loading the tags. MATCH checks if any byte loaded from memory is equal to the needle, setting flags accordingly. We can use it to quickly branch if no byte is equal to the needle. The emitted asm looks like this: 2c3c70: a40e4141 ld1b {z1.b}, p0/z, [x10, x14] 2c3c74: 45208021 match p1.b, p0/z, z1.b, z0.b 2c3c78: 540001e0 b.eq 2c3cb4 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x348> // b.none 2c3c7c: 6e208c21 cmeq v1.16b, v1.16b, v0.16b 2c3c80: 910041ae add x14, x13, #0x10 2c3c84: 05800701 and z1.b, z1.b, #0x11 2c3c88: 0f0c8421 shrn v1.8b, v1.8h, facebook#4 ... Instruction cmeq is likely to be executed speculatevely alongside match. I also tried using the output of match in a broadcast instruction: svdup_n_u8_z(outPred, 17); The dup reoplaces the cmeq+and, but it still showed equal or slower than cmeq+and, further suggesting that the cmeq will execute alongside match. I'll ask ARM engineers if they recommend any sequence between the two. In newer CPUs implementing SVE2.1, we will be able to move the predicate into a simd register. Maybe code layout can be improved to be less ugly. Benchmark shows ~10% reduction in find latency: Before: Find f14node<NonSSOString, a[128]>[11] 28.08ns 35.61M Find f14val<NonSSOString, a[128]>[11] 98.784% 28.43ns 35.18M Find f14vec<NonSSOString, a[128]>[11] 101.61% 27.64ns 36.18M ---------------------------------------------------------------------------- Find f14node<NonSSOString, a[1]>[11] 28.07ns 35.62M Find f14val<NonSSOString, a[1]>[11] 98.209% 28.58ns 34.99M Find f14vec<NonSSOString, a[1]>[11] 101.54% 27.65ns 36.17M ---------------------------------------------------------------------------- Find f14node<std::string, a[128]>[11] 28.07ns 35.62M Find f14val<std::string, a[128]>[11] 97.935% 28.66ns 34.89M Find f14vec<std::string, a[128]>[11] 101.42% 27.68ns 36.13M ---------------------------------------------------------------------------- Find f14node<std::string, a[1]>[11] 28.05ns 35.65M Find f14val<std::string, a[1]>[11] 97.259% 28.84ns 34.67M Find f14vec<std::string, a[1]>[11] 100.40% 27.94ns 35.79M After: Find f14node<NonSSOString, a[128]>[11] 25.81ns 38.75M Find f14val<NonSSOString, a[128]>[11] 99.176% 26.02ns 38.43M Find f14vec<NonSSOString, a[128]>[11] 100.04% 25.80ns 38.76M ---------------------------------------------------------------------------- Find f14node<NonSSOString, a[1]>[11] 25.81ns 38.75M Find f14val<NonSSOString, a[1]>[11] 99.176% 26.02ns 38.43M Find f14vec<NonSSOString, a[1]>[11] 100.02% 25.80ns 38.76M ---------------------------------------------------------------------------- Find f14node<std::string, a[128]>[11] 26.33ns 37.98M Find f14val<std::string, a[128]>[11] 101.23% 26.01ns 38.45M Find f14vec<std::string, a[128]>[11] 95.719% 27.50ns 36.36M ---------------------------------------------------------------------------- Find f14node<std::string, a[1]>[11] 26.36ns 37.93M Find f14val<std::string, a[1]>[11] 101.80% 25.90ns 38.62M Find f14vec<std::string, a[1]>[11] 96.690% 27.26ns 36.68M Improvement is likely to be higher on dense maps, while lower on sparse maps without the tags in cache Reviewed By: yfeldblum Differential Revision: D93997423

Summary: The result of SparseMaskIter's next is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the and is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 Differential Revision: D94030304

meta-codesync · 2026-02-22T21:38:42Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94030304.

Nicoshev added 2 commits February 22, 2026 13:38

meta-cla bot added the CLA Signed label Feb 22, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve successful find speed by 1 cycle on Aarch64#9726

Improve successful find speed by 1 cycle on Aarch64#9726
Nicoshev wants to merge 2 commits intofacebook:masterfrom
Nicoshev:export-D94030304

Nicoshev commented Feb 22, 2026

Uh oh!

meta-codesync bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nicoshev commented Feb 22, 2026

Uh oh!

meta-codesync bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant