SSE Idiom Normalization & Peephole Optimization

## 1. Executive Summary
The Problem: sse2neon currently relies on a direct 1:1 mapping of intrinsics. While this ensures functional correctness, it misses high-level SIMD idioms (patterns of multiple instructions). On x86, instructions like `PMOVMSKB` are highly specific; emulating them in NEON via a naive 1:1 mapping results in long dependency chains and performance degradation (often 1.5x to 2x slower for specific operations).

The Solution: We propose `sse2neon-opt`, an optional build-time tool based on Python and libclang. This tool performs source-to-source rewriting to recognize SSE patterns (e.g., compare + movemask) and replaces them with semantically equivalent, high-performance `s2n_*` helpers or canonical intrinsics.

The Goal: Move sse2neon from being a compatibility layer to a performance optimization tool, unlocking NEON-specific instruction sequences (like `vshrn` for bitmasks) without forcing users to manually rewrite their legacy SSE codebases.

## 2. Motivation & Background

### 2.1 The Movemask Bottleneck
The primary motivator for this proposal is the performance gap identified by Danila Kutenin in his article "[Porting x86 vector bitmask optimizations to Arm NEON](https://developer.arm.com/community/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon)" (Arm Developer Blog).

Kutenin points out that:
1.  No Direct Equivalent: NEON has no single instruction equivalent to x86's PMOVMSKB (which moves the sign bit of every byte into a scalar register).
2.  The Performance Cost: Emulating PMOVMSKB generically requires 4 to 6 NEON instructions (`vsra`, `vand`, `vpadd`, etc.).
3.  The Solution: The fastest way to handle bitmasks on NEON is to change the data format entirely. Instead of a dense 16-bit mask, NEON performs best using a Nibble Mask (where each byte is represented by 4 bits).

### 2.2 Why the Current Header-Only Approach Fails
The current sse2neon.h cannot implement Kutenin's optimization because it lacks context.
*   The Header Sees: `_mm_movemask_epi8(var)`. It does not know where var came from. It must assume var contains arbitrary data.
*   The Reality: In 90% of string and hash algorithms, var is the result of `_mm_cmpeq_epi8` (containing only 0x00 or 0xFF).
*   The Missed Opportunity: If we knew the input was a comparison result, we could use the NEON instruction `vshrn_n_u16` to create the mask in just 2 instructions.

## 3. Landscape & Competitive Analysis
Why build a new tool instead of relying on existing solutions?

Approach: Current sse2neon / SIMDe
Methodology: Header-only static inline functions.
Limitation: Myopic. It cannot see across statement boundaries (e.g., it cannot see that cmp feeds into movemask). This forces generic, slow implementations.

Approach: Compiler Auto-Vectorization (LLVM/GCC)
Methodology: Loop analysis and IR optimization.
Limitation: Fragile. Intrinsics are often treated as opaque commands. Compilers rarely reverse-engineer a sequence of explicit SSE intrinsics back into high-level intent to re-emit optimized NEON.

Approach: Manual Porting (e.g., Rust memchr)
Methodology: Hand-written NEON assembly.
Limitation: High Maintenance. Requires maintaining two distinct code paths. Our proposal aims to automate the logic found in manual ports.

## 4. Technical Case Studies: The Instruction Gap

### Case A: Compare & Movemask (The Nibble Mask Idiom)
Common in: strlen, memchr, JSON parsers, SimdJson.

Original SSE Source:
```c
__m128i cmp = _mm_cmpeq_epi8(vec, target);
int mask = _mm_movemask_epi8(cmp);
```

Comparison:
*   x86 Native: Uses PMOVMSKB. Low latency, high throughput. Result is a 16-bit integer.
*   Current sse2neon: Uses `vsra`, `vsra`, `vand`, `vpadd`, `vget`. High latency due to dependency chains. Result is a 16-bit integer.
*   Proposed sse2neon-opt: Uses `vceqq` + `vshrn`. Low latency. Result is a 64-bit Nibble Mask.

The Transformation:

The tool rewrites the C code to:
```c
uint64_t mask = s2n_cmpmask_eq_epi8(vec, target);
```

The s2n helper uses `vshrn_n_u16` (Vector Narrowing Shift), which is the technique recommended by the Arm blog.

### Case B: Vector Rotation
Common in: Cryptography, Hashing (MurmurHash, CityHash).

Original SSE Source:
```c
__m128i rot = _mm_or_si128(
    _mm_slli_si128(v, 4),
    _mm_srli_si128(v, 12) // 16 - 4 = 12
);
```

The Transformation:

The tool recognizes that immL + immR == 16, identifying this as a rotation.
```c
__m128i rot = _mm_alignr_epi8(v, v, 12);
```

Benefit: On NEON, `_mm_alignr_epi8` maps directly to `vextq_u8` (1 instruction). The original sequence expands to 3 instructions (shl + shr + or).

## 5. Implementation Strategy: Python + libclang
We utilize [libclang](https://clang.llvm.org/docs/LibClang.html) because a standard text-based regex search is unsafe for C code. Libclang provides a full Abstract Syntax Tree (AST), allowing us to track variable definitions and ensure transformations are semantically safe.

### 5.1 The Workflow
1.  Analyze: The user runs `sse2neon-opt.py input.c -o output.c`.
2.  Rewrite: The script identifies recognized AST patterns and replaces the source text with `s2n_*` helpers.
3.  Compile: The build system compiles output.c with `-include sse2neon.h`.

### 5.2 Proof of Concept (PoC) Code

Below is a functional script detecting the `movemask(cmpeq(...))` pattern.

```python
#!/usr/bin/env python3
# sse2neon-opt: SSE Idiom Normalizer
import sys
import argparse
from clang import cindex

def find_movemask_idioms(node, rewrites, source_bytes):
    """
    Traverse AST to find: _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
    """
    if node.kind == cindex.CursorKind.CALL_EXPR and node.spelling == "_mm_movemask_epi8":
        args = list(node.get_arguments())
        if len(args) == 1:
            arg_cursor = args[0]
            # Recursively find if the argument is a call to cmpeq
            # This handles slight nesting or parentheses
            child_call = None
            for child in arg_cursor.walk_preorder():
                if child.kind == cindex.CursorKind.CALL_EXPR and child.spelling == "_mm_cmpeq_epi8":
                    child_call = child
                    break
            
            if child_call:
                # Found pattern: movemask(cmpeq(a,b))
                cmp_args = list(child_call.get_arguments())
                if len(cmp_args) == 2:
                    # Extract raw text of arguments to preserve variable names/expressions
                    def get_text(cursor):
                        return source_bytes[cursor.extent.start.offset : cursor.extent.end.offset].decode('utf-8')

                    a_text = get_text(cmp_args[0])
                    b_text = get_text(cmp_args[1])
                    
                    # Create replacement string
                    replacement = f"s2n_cmpmask_eq_epi8({a_text}, {b_text})"
                    
                    # Record rewrite (Start Offset, End Offset, Replacement)
                    rewrites.append((node.extent.start.offset, node.extent.end.offset, replacement))
                return # Stop processing children to avoid double matches

    for child in node.get_children():
        find_movemask_idioms(child, rewrites, source_bytes)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("input", help="Source file")
    parser.add_argument("-o", "--output", required=True, help="Output file")
    parser.add_argument("--clang-inc", action="append", help="Include paths for clang")
    args = parser.parse_args()

    index = cindex.Index.create()
    clang_args = []
    if args.clang_inc:
        for inc in args.clang_inc:
            clang_args.append(f"-I{inc}")

    tu = index.parse(args.input, args=clang_args)
    
    with open(args.input, 'rb') as f:
        source_bytes = f.read()

    rewrites = []
    find_movemask_idioms(tu.cursor, rewrites, source_bytes)

    # Apply rewrites in reverse order to preserve offsets
    rewrites.sort(key=lambda x: x[0], reverse=True)
    out_bytes = bytearray(source_bytes)
    for start, end, text in rewrites:
        out_bytes[start:end] = text.encode('utf-8')

    with open(args.output, 'wb') as f:
        f.write(out_bytes)
    
    print(f"Applied {len(rewrites)} optimizations.")

if __name__ == "__main__":
    main()
```

### 5.3 The Helper Implementation (in sse2neon.h)

We add the optimized implementation for the rewritten code:

```c
/* Added to sse2neon.h */

// Optimized for NEON: Uses vceqq + vshrn to produce a 64-bit "nibble mask".
// Each set byte in the vector results in a 0xF nibble in the output.
static inline uint64_t s2n_cmpmask_eq_epi8(__m128i a, __m128i b) {
#if defined(__ARM_NEON)
    // 1. Compare (Results in 0x00 or 0xFF per byte)
    uint8x16_t cmp = vceqq_u8(vreinterpretq_u8_s8(a), vreinterpretq_u8_s8(b));
    // 2. Narrow 128-bit to 64-bit.
    // vshrn_n_u16 takes 16-bit elements and shifts right by 4, narrowing to 8-bit.
    // This effectively turns 0xFF (byte) into 0x0F (nibble).
    uint8x8_t narrow = vshrn_n_u16(vreinterpretq_u16_u8(cmp), 4);
    // 3. Move to Scalar
    return vget_lane_u64(vreinterpret_u64_u8(narrow), 0);
#else
    // x86 Fallback: Maintain compatibility
    return _mm_movemask_epi8(_mm_cmpeq_epi8(a, b));
#endif
}
```

## 6. References
1.  Kutenin, Danila. "Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more." Arm Community Blog, 2022.
2.  Gallant, Andrew (BurntSushi). memchr crate source code (Rust). [GitHub](https://github.com/BurntSushi/memchr). (Reference for the Nibble Mask technique).
3.  Google Abseil. [SwissTable SIMD implementations](https://github.com/google/cwisstable/blob/main/DESIGN.md). (Reference for probing optimization).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SSE Idiom Normalization & Peephole Optimization #712

1. Executive Summary

2. Motivation & Background

2.1 The Movemask Bottleneck

2.2 Why the Current Header-Only Approach Fails

3. Landscape & Competitive Analysis

4. Technical Case Studies: The Instruction Gap

Case A: Compare & Movemask (The Nibble Mask Idiom)

Case B: Vector Rotation

5. Implementation Strategy: Python + libclang

5.1 The Workflow

5.2 Proof of Concept (PoC) Code

5.3 The Helper Implementation (in sse2neon.h)

6. References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SSE Idiom Normalization & Peephole Optimization #712

Description

1. Executive Summary

2. Motivation & Background

2.1 The Movemask Bottleneck

2.2 Why the Current Header-Only Approach Fails

3. Landscape & Competitive Analysis

4. Technical Case Studies: The Instruction Gap

Case A: Compare & Movemask (The Nibble Mask Idiom)

Case B: Vector Rotation

5. Implementation Strategy: Python + libclang

5.1 The Workflow

5.2 Proof of Concept (PoC) Code

5.3 The Helper Implementation (in sse2neon.h)

6. References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions