Improve successful find speed by 1 cycle on Aarch64#9726
Open
Nicoshev wants to merge 2 commits intofacebook:masterfrom
Open
Improve successful find speed by 1 cycle on Aarch64#9726Nicoshev wants to merge 2 commits intofacebook:masterfrom
Nicoshev wants to merge 2 commits intofacebook:masterfrom
Conversation
Summary: X-link: facebook/folly#2588 Idea is to use instruction MATCH right after loading the tags. MATCH checks if any byte loaded from memory is equal to the needle, setting flags accordingly. We can use it to quickly branch if no byte is equal to the needle. The emitted asm looks like this: 2c3c70: a40e4141 ld1b {z1.b}, p0/z, [x10, x14] 2c3c74: 45208021 match p1.b, p0/z, z1.b, z0.b 2c3c78: 540001e0 b.eq 2c3cb4 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x348> // b.none 2c3c7c: 6e208c21 cmeq v1.16b, v1.16b, v0.16b 2c3c80: 910041ae add x14, x13, #0x10 2c3c84: 05800701 and z1.b, z1.b, #0x11 2c3c88: 0f0c8421 shrn v1.8b, v1.8h, facebook#4 ... Instruction cmeq is likely to be executed speculatevely alongside match. I also tried using the output of match in a broadcast instruction: svdup_n_u8_z(outPred, 17); The dup reoplaces the cmeq+and, but it still showed equal or slower than cmeq+and, further suggesting that the cmeq will execute alongside match. I'll ask ARM engineers if they recommend any sequence between the two. In newer CPUs implementing SVE2.1, we will be able to move the predicate into a simd register. Maybe code layout can be improved to be less ugly. Benchmark shows ~10% reduction in find latency: Before: Find f14node<NonSSOString, a[128]>[11] 28.08ns 35.61M Find f14val<NonSSOString, a[128]>[11] 98.784% 28.43ns 35.18M Find f14vec<NonSSOString, a[128]>[11] 101.61% 27.64ns 36.18M ---------------------------------------------------------------------------- Find f14node<NonSSOString, a[1]>[11] 28.07ns 35.62M Find f14val<NonSSOString, a[1]>[11] 98.209% 28.58ns 34.99M Find f14vec<NonSSOString, a[1]>[11] 101.54% 27.65ns 36.17M ---------------------------------------------------------------------------- Find f14node<std::string, a[128]>[11] 28.07ns 35.62M Find f14val<std::string, a[128]>[11] 97.935% 28.66ns 34.89M Find f14vec<std::string, a[128]>[11] 101.42% 27.68ns 36.13M ---------------------------------------------------------------------------- Find f14node<std::string, a[1]>[11] 28.05ns 35.65M Find f14val<std::string, a[1]>[11] 97.259% 28.84ns 34.67M Find f14vec<std::string, a[1]>[11] 100.40% 27.94ns 35.79M After: Find f14node<NonSSOString, a[128]>[11] 25.81ns 38.75M Find f14val<NonSSOString, a[128]>[11] 99.176% 26.02ns 38.43M Find f14vec<NonSSOString, a[128]>[11] 100.04% 25.80ns 38.76M ---------------------------------------------------------------------------- Find f14node<NonSSOString, a[1]>[11] 25.81ns 38.75M Find f14val<NonSSOString, a[1]>[11] 99.176% 26.02ns 38.43M Find f14vec<NonSSOString, a[1]>[11] 100.02% 25.80ns 38.76M ---------------------------------------------------------------------------- Find f14node<std::string, a[128]>[11] 26.33ns 37.98M Find f14val<std::string, a[128]>[11] 101.23% 26.01ns 38.45M Find f14vec<std::string, a[128]>[11] 95.719% 27.50ns 36.36M ---------------------------------------------------------------------------- Find f14node<std::string, a[1]>[11] 26.36ns 37.93M Find f14val<std::string, a[1]>[11] 101.80% 25.90ns 38.62M Find f14vec<std::string, a[1]>[11] 96.690% 27.26ns 36.68M Improvement is likely to be higher on dense maps, while lower on sparse maps without the tags in cache Reviewed By: yfeldblum Differential Revision: D93997423
Summary: The result of SparseMaskIter's next is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the and is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 Differential Revision: D94030304
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8
Before the assembly looked like this:
clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]
After the changes, we verified the and is omitted:
clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]
By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
Differential Revision: D94030304