-
Notifications
You must be signed in to change notification settings - Fork 231
Description
1. Executive Summary
The Problem: sse2neon currently relies on a direct 1:1 mapping of intrinsics. While this ensures functional correctness, it misses high-level SIMD idioms (patterns of multiple instructions). On x86, instructions like PMOVMSKB are highly specific; emulating them in NEON via a naive 1:1 mapping results in long dependency chains and performance degradation (often 1.5x to 2x slower for specific operations).
The Solution: We propose sse2neon-opt, an optional build-time tool based on Python and libclang. This tool performs source-to-source rewriting to recognize SSE patterns (e.g., compare + movemask) and replaces them with semantically equivalent, high-performance s2n_* helpers or canonical intrinsics.
The Goal: Move sse2neon from being a compatibility layer to a performance optimization tool, unlocking NEON-specific instruction sequences (like vshrn for bitmasks) without forcing users to manually rewrite their legacy SSE codebases.
2. Motivation & Background
2.1 The Movemask Bottleneck
The primary motivator for this proposal is the performance gap identified by Danila Kutenin in his article "Porting x86 vector bitmask optimizations to Arm NEON" (Arm Developer Blog).
Kutenin points out that:
- No Direct Equivalent: NEON has no single instruction equivalent to x86's PMOVMSKB (which moves the sign bit of every byte into a scalar register).
- The Performance Cost: Emulating PMOVMSKB generically requires 4 to 6 NEON instructions (
vsra,vand,vpadd, etc.). - The Solution: The fastest way to handle bitmasks on NEON is to change the data format entirely. Instead of a dense 16-bit mask, NEON performs best using a Nibble Mask (where each byte is represented by 4 bits).
2.2 Why the Current Header-Only Approach Fails
The current sse2neon.h cannot implement Kutenin's optimization because it lacks context.
- The Header Sees:
_mm_movemask_epi8(var). It does not know where var came from. It must assume var contains arbitrary data. - The Reality: In 90% of string and hash algorithms, var is the result of
_mm_cmpeq_epi8(containing only 0x00 or 0xFF). - The Missed Opportunity: If we knew the input was a comparison result, we could use the NEON instruction
vshrn_n_u16to create the mask in just 2 instructions.
3. Landscape & Competitive Analysis
Why build a new tool instead of relying on existing solutions?
Approach: Current sse2neon / SIMDe
Methodology: Header-only static inline functions.
Limitation: Myopic. It cannot see across statement boundaries (e.g., it cannot see that cmp feeds into movemask). This forces generic, slow implementations.
Approach: Compiler Auto-Vectorization (LLVM/GCC)
Methodology: Loop analysis and IR optimization.
Limitation: Fragile. Intrinsics are often treated as opaque commands. Compilers rarely reverse-engineer a sequence of explicit SSE intrinsics back into high-level intent to re-emit optimized NEON.
Approach: Manual Porting (e.g., Rust memchr)
Methodology: Hand-written NEON assembly.
Limitation: High Maintenance. Requires maintaining two distinct code paths. Our proposal aims to automate the logic found in manual ports.
4. Technical Case Studies: The Instruction Gap
Case A: Compare & Movemask (The Nibble Mask Idiom)
Common in: strlen, memchr, JSON parsers, SimdJson.
Original SSE Source:
__m128i cmp = _mm_cmpeq_epi8(vec, target);
int mask = _mm_movemask_epi8(cmp);Comparison:
- x86 Native: Uses PMOVMSKB. Low latency, high throughput. Result is a 16-bit integer.
- Current sse2neon: Uses
vsra,vsra,vand,vpadd,vget. High latency due to dependency chains. Result is a 16-bit integer. - Proposed sse2neon-opt: Uses
vceqq+vshrn. Low latency. Result is a 64-bit Nibble Mask.
The Transformation:
The tool rewrites the C code to:
uint64_t mask = s2n_cmpmask_eq_epi8(vec, target);The s2n helper uses vshrn_n_u16 (Vector Narrowing Shift), which is the technique recommended by the Arm blog.
Case B: Vector Rotation
Common in: Cryptography, Hashing (MurmurHash, CityHash).
Original SSE Source:
__m128i rot = _mm_or_si128(
_mm_slli_si128(v, 4),
_mm_srli_si128(v, 12) // 16 - 4 = 12
);The Transformation:
The tool recognizes that immL + immR == 16, identifying this as a rotation.
__m128i rot = _mm_alignr_epi8(v, v, 12);Benefit: On NEON, _mm_alignr_epi8 maps directly to vextq_u8 (1 instruction). The original sequence expands to 3 instructions (shl + shr + or).
5. Implementation Strategy: Python + libclang
We utilize libclang because a standard text-based regex search is unsafe for C code. Libclang provides a full Abstract Syntax Tree (AST), allowing us to track variable definitions and ensure transformations are semantically safe.
5.1 The Workflow
- Analyze: The user runs
sse2neon-opt.py input.c -o output.c. - Rewrite: The script identifies recognized AST patterns and replaces the source text with
s2n_*helpers. - Compile: The build system compiles output.c with
-include sse2neon.h.
5.2 Proof of Concept (PoC) Code
Below is a functional script detecting the movemask(cmpeq(...)) pattern.
#!/usr/bin/env python3
# sse2neon-opt: SSE Idiom Normalizer
import sys
import argparse
from clang import cindex
def find_movemask_idioms(node, rewrites, source_bytes):
"""
Traverse AST to find: _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
"""
if node.kind == cindex.CursorKind.CALL_EXPR and node.spelling == "_mm_movemask_epi8":
args = list(node.get_arguments())
if len(args) == 1:
arg_cursor = args[0]
# Recursively find if the argument is a call to cmpeq
# This handles slight nesting or parentheses
child_call = None
for child in arg_cursor.walk_preorder():
if child.kind == cindex.CursorKind.CALL_EXPR and child.spelling == "_mm_cmpeq_epi8":
child_call = child
break
if child_call:
# Found pattern: movemask(cmpeq(a,b))
cmp_args = list(child_call.get_arguments())
if len(cmp_args) == 2:
# Extract raw text of arguments to preserve variable names/expressions
def get_text(cursor):
return source_bytes[cursor.extent.start.offset : cursor.extent.end.offset].decode('utf-8')
a_text = get_text(cmp_args[0])
b_text = get_text(cmp_args[1])
# Create replacement string
replacement = f"s2n_cmpmask_eq_epi8({a_text}, {b_text})"
# Record rewrite (Start Offset, End Offset, Replacement)
rewrites.append((node.extent.start.offset, node.extent.end.offset, replacement))
return # Stop processing children to avoid double matches
for child in node.get_children():
find_movemask_idioms(child, rewrites, source_bytes)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("input", help="Source file")
parser.add_argument("-o", "--output", required=True, help="Output file")
parser.add_argument("--clang-inc", action="append", help="Include paths for clang")
args = parser.parse_args()
index = cindex.Index.create()
clang_args = []
if args.clang_inc:
for inc in args.clang_inc:
clang_args.append(f"-I{inc}")
tu = index.parse(args.input, args=clang_args)
with open(args.input, 'rb') as f:
source_bytes = f.read()
rewrites = []
find_movemask_idioms(tu.cursor, rewrites, source_bytes)
# Apply rewrites in reverse order to preserve offsets
rewrites.sort(key=lambda x: x[0], reverse=True)
out_bytes = bytearray(source_bytes)
for start, end, text in rewrites:
out_bytes[start:end] = text.encode('utf-8')
with open(args.output, 'wb') as f:
f.write(out_bytes)
print(f"Applied {len(rewrites)} optimizations.")
if __name__ == "__main__":
main()5.3 The Helper Implementation (in sse2neon.h)
We add the optimized implementation for the rewritten code:
/* Added to sse2neon.h */
// Optimized for NEON: Uses vceqq + vshrn to produce a 64-bit "nibble mask".
// Each set byte in the vector results in a 0xF nibble in the output.
static inline uint64_t s2n_cmpmask_eq_epi8(__m128i a, __m128i b) {
#if defined(__ARM_NEON)
// 1. Compare (Results in 0x00 or 0xFF per byte)
uint8x16_t cmp = vceqq_u8(vreinterpretq_u8_s8(a), vreinterpretq_u8_s8(b));
// 2. Narrow 128-bit to 64-bit.
// vshrn_n_u16 takes 16-bit elements and shifts right by 4, narrowing to 8-bit.
// This effectively turns 0xFF (byte) into 0x0F (nibble).
uint8x8_t narrow = vshrn_n_u16(vreinterpretq_u16_u8(cmp), 4);
// 3. Move to Scalar
return vget_lane_u64(vreinterpret_u64_u8(narrow), 0);
#else
// x86 Fallback: Maintain compatibility
return _mm_movemask_epi8(_mm_cmpeq_epi8(a, b));
#endif
}6. References
- Kutenin, Danila. "Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more." Arm Community Blog, 2022.
- Gallant, Andrew (BurntSushi). memchr crate source code (Rust). GitHub. (Reference for the Nibble Mask technique).
- Google Abseil. SwissTable SIMD implementations. (Reference for probing optimization).