wasm acceleration for gf2_128_mul used in qs check

In wasm builds, the QuickSilver check accounts for ~40% of the protocol runtime, so it may make sense to accelerate gf2_128 multiplication with WASM SIMD.

Claude suggested the following approaches, even though it probably overestimated perf gains. But seems plausible that there is at least a 2x improvement:

WASM SIMD Acceleration for soft32 Carry-Less Multiplication
===========================================================

Executive Summary
-----------------

**Yes, the soft32 implementation can be accelerated using 128-bit WASM SIMD**, but the achievable speedup depends heavily on the workload pattern:

| Scenario | Expected Speedup | Recommended Approach |
| --- | --- | --- |
| Single clmul operation | 1.5-2x | Parallel multiplications |
| Small batches (2-8 ops) | 2-3x | Interleaved SIMD |
| Large batches (128+ ops) | 10-20x | Bitsliced implementation |
| GHASH/GCM streams | 15-25x | Bitsliced + pipelining |

* * *

Background: The soft32 Algorithm
--------------------------------

The `soft32.rs` implementation almost certainly uses the BearSSL-style "holes" algorithm for constant-time carry-less multiplication:

    Algorithm: bmul32(x, y) → 64-bit carry-less product
    
    1. Create bitmasks with 3-bit holes:
       m1 = 0x11111111  (0001_0001_0001_...)
       m2 = 0x22222222  (0010_0010_0010_...)
       m4 = 0x44444444  (0100_0100_0100_...)
       m8 = 0x88888888  (1000_1000_1000_...)
    
    2. Extract every 4th bit: x0 = x & m1, x1 = x & m2, etc.
    
    3. Perform 16 integer multiplications with XOR combining:
       z0 = (x0*y0) ^ (x1*y3) ^ (x2*y2) ^ (x3*y1)
       z1 = (x0*y1) ^ (x1*y0) ^ (x2*y3) ^ (x3*y2)
       z2 = (x0*y2) ^ (x1*y1) ^ (x2*y0) ^ (x3*y3)
       z3 = (x0*y3) ^ (x1*y2) ^ (x2*y1) ^ (x3*y0)
    
    4. Mask and merge: result = (z0 & m1_64) | (z1 & m2_64) | ...
    

**Why it works:** The 3-bit holes prevent carries from "spilling" between data bits. With 8 data bits per operand (32/4), worst-case carry accumulation is 8 = 0b1000, which fits in 3 bits.

* * *

WASM SIMD Capabilities
----------------------

WebAssembly SIMD provides 128-bit vectors with these relevant operations:

| Operation | What it does | Useful for |
| --- | --- | --- |
| `i64x2_extmul_low_u32x4` | 2× widening 32→64 multiply | Parallel muls |
| `i64x2_extmul_high_u32x4` | 2× widening 32→64 multiply | Parallel muls |
| `i32x4_mul` | 4× truncating 32-bit multiply | Limited use |
| `v128_xor` | 128-bit XOR | Core operation |
| `v128_and` | 128-bit AND | Masking |
| `i64x2_shl` / `i64x2_shr` | Shift lanes | Bitslicing |

**Critical limitation:** WASM SIMD has **no CLMUL instruction**. We must implement carry-less multiplication using regular arithmetic.

* * *

Approach 1: Parallelize the Multiplications
-------------------------------------------

The soft32 algorithm performs 16 scalar 32×32→64 multiplications. With WASM SIMD's `i64x2_extmul`, we can do 2 at a time.

### Implementation Sketch

    // Pack operands for parallel multiplication
    let x_vec = u32x4(x0, x1, x2, x3);
    let y_for_z0 = u32x4(y0, y3, y2, y1);
    
    // Two multiplies at once: lanes [0,1] → two u64 results
    let z0_lo = i64x2_extmul_low_u32x4(x_vec, y_for_z0);   // x0*y0, x1*y3
    let z0_hi = i64x2_extmul_high_u32x4(x_vec, y_for_z0);  // x2*y2, x3*y1
    
    // XOR the results together
    let z0_xor = v128_xor(z0_lo, z0_hi);
    let z0 = extract_lane_0(z0_xor) ^ extract_lane_1(z0_xor);
    

### Performance Analysis

*   **Multiplications:** 16 scalar → 8 SIMD ops (2x parallel)
*   **Overhead:** Lane extraction and scalar XOR for final reduction
*   **Expected speedup:** **1.5-2x** for single operations

### Pros/Cons

✅ Drop-in replacement for scalar version  
✅ No data transposition needed  
❌ Horizontal XOR reduction is expensive  
❌ Limited parallelism (only 2x per multiply group)

* * *

Approach 2: Batch Processing
----------------------------

Process multiple independent clmul operations together to better utilize SIMD width.

### Example: Processing 4 clmul operations

    // Interleave 4 independent clmul operations
    // Pack: [x0_a, x0_b, x0_c, x0_d] as u32x4
    // This gives 4-way parallelism on the masking/extraction phase
    

### Expected speedup: **2-3x** per operation in batches of 4+

* * *

Approach 3: Bitsliced Implementation (Recommended for Streams)
--------------------------------------------------------------

For maximum SIMD efficiency, **bitslice** the data. This transposes the representation so that bit N of 128 different inputs is stored in one v128.

### How Bitslicing Works

    Normal representation:
      value_0 = [bit0, bit1, bit2, ..., bit31]
      value_1 = [bit0, bit1, bit2, ..., bit31]
      ...
      value_127 = [bit0, bit1, bit2, ..., bit31]
    
    Bitsliced representation:
      slice_0 = [value_0.bit0, value_1.bit0, ..., value_127.bit0]  // 128 bits
      slice_1 = [value_0.bit1, value_1.bit1, ..., value_127.bit1]
      ...
      slice_31 = [value_0.bit31, ..., value_127.bit31]
    

### Bitsliced Carry-Less Multiplication

Once data is bitsliced, clmul becomes a series of XOR and AND operations:

    // For each bit position i in a, and j in b:
    // If both bits are set, XOR into result bit (i+j)
    for i in 0..32 {
        for j in 0..32 {
            let product_bit = v128_and(a_bits[i], b_bits[j]);
            result[i + j] = v128_xor(result[i + j], product_bit);
        }
    }
    

### Performance Analysis

*   **Operations:** 32 × 32 = 1024 AND + 1024 XOR for 128 parallel clmuls
*   **Per-operation cost:** ~8 AND + 8 XOR (amortized)
*   **Expected speedup:** **10-20x** over scalar for large batches

### Pros/Cons

✅ Maximum SIMD utilization (128-way parallelism)  
✅ All operations are simple XOR/AND (very fast)  
✅ Naturally constant-time  
❌ Requires data transposition (expensive for small batches)  
❌ Memory bandwidth for bitsliced representation

* * *

Approach 4: Karatsuba Decomposition
-----------------------------------

For 64-bit or 128-bit clmul (as used in GHASH), use Karatsuba to reduce multiplications:

    64-bit clmul using 32-bit building blocks:
    
    a = a_hi * 2^32 + a_lo
    b = b_hi * 2^32 + b_lo
    
    Normal: 4 multiplications
      a_lo * b_lo, a_lo * b_hi, a_hi * b_lo, a_hi * b_hi
    
    Karatsuba: 3 multiplications
      p_lo  = clmul32(a_lo, b_lo)
      p_hi  = clmul32(a_hi, b_hi)  
      p_mid = clmul32(a_lo ^ a_hi, b_lo ^ b_hi)
      
      cross = p_mid ^ p_lo ^ p_hi  // = a_lo*b_hi + a_hi*b_lo
      result = p_lo | (cross << 32) | (p_hi << 64)
    

**Benefit:** 25% fewer multiplications, which compounds with SIMD parallelization.

* * *

Recommendation for mpz/clmul
----------------------------

Given that this is for **TLSNotary** (a GHASH/GCM workload), I recommend:

### 1\. For Single Operations: Parallel Multiplies (Approach 1)

*   Immediate ~1.5x speedup
*   Minimal code changes

### 2\. For Block Processing (GHASH): Bitsliced (Approach 3)

*   Restructure to process 128 blocks at a time
*   Use Karatsuba for the 128-bit field multiplication
*   **Expected 15-25x speedup** over scalar soft32

### 3\. Implementation Priority

    wasm32-simd/
    ├── soft32_simd.rs      # Parallel multiply version (quick win)
    ├── bitsliced.rs        # For batch processing
    └── ghash_simd.rs       # GHASH-specific optimizations
    

* * *

Comparison with Other Approaches
--------------------------------

| Implementation | Cycles/clmul32 (est.) | Notes |
| --- | --- | --- |
| Scalar soft32 | ~50-80 | Baseline |
| WASM SIMD parallel | ~30-50 | Approach 1 |
| WASM SIMD bitsliced | ~4-8 | Per-op amortized |
| x86 PCLMULQDQ | ~3-7 | Hardware (not available in WASM) |

* * *

Code Example
------------

See the accompanying `wasm_simd_clmul.rs` for complete implementations of:

*   Parallel multiply version (`clmul_soft32_simd_v1`)
*   Batched processing (`clmul_batch_2`)
*   Bitsliced multiplication (`clmul_bitsliced_32x128`)
*   Karatsuba 64-bit clmul (`clmul64_karatsuba`)

* * *

Conclusion
----------

WASM SIMD can meaningfully accelerate the soft32 clmul implementation:

1.  **Quick win (1.5-2x):** Use `i64x2_extmul` to parallelize the 16 multiplications
2.  **Maximum performance (10-20x):** Bitslice the data for stream processing
3.  **Combine with Karatsuba** for 64/128-bit operations

The best approach depends on your access patterns. For GHASH workloads processing many blocks, the bitsliced approach will dominate. For occasional single operations, the parallel multiply approach offers good improvement with minimal complexity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wasm acceleration for gf2_128_mul used in qs check #365

WASM SIMD Acceleration for soft32 Carry-Less Multiplication

Executive Summary

Background: The soft32 Algorithm

WASM SIMD Capabilities

Approach 1: Parallelize the Multiplications

Implementation Sketch

Performance Analysis

Pros/Cons

Approach 2: Batch Processing

Example: Processing 4 clmul operations

Expected speedup: 2-3x per operation in batches of 4+

Approach 3: Bitsliced Implementation (Recommended for Streams)

How Bitslicing Works

Bitsliced Carry-Less Multiplication

Performance Analysis

Pros/Cons

Approach 4: Karatsuba Decomposition

Recommendation for mpz/clmul

1. For Single Operations: Parallel Multiplies (Approach 1)

2. For Block Processing (GHASH): Bitsliced (Approach 3)

3. Implementation Priority

Comparison with Other Approaches

Code Example

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Expected Speedup	Recommended Approach
Single clmul operation	1.5-2x	Parallel multiplications
Small batches (2-8 ops)	2-3x	Interleaved SIMD
Large batches (128+ ops)	10-20x	Bitsliced implementation
GHASH/GCM streams	15-25x	Bitsliced + pipelining

Operation	What it does	Useful for
`i64x2_extmul_low_u32x4`	2× widening 32→64 multiply	Parallel muls
`i64x2_extmul_high_u32x4`	2× widening 32→64 multiply	Parallel muls
`i32x4_mul`	4× truncating 32-bit multiply	Limited use
`v128_xor`	128-bit XOR	Core operation
`v128_and`	128-bit AND	Masking
`i64x2_shl` / `i64x2_shr`	Shift lanes	Bitslicing

Implementation	Cycles/clmul32 (est.)	Notes
Scalar soft32	~50-80	Baseline
WASM SIMD parallel	~30-50	Approach 1
WASM SIMD bitsliced	~4-8	Per-op amortized
x86 PCLMULQDQ	~3-7	Hardware (not available in WASM)

wasm acceleration for gf2_128_mul used in qs check #365

Description

WASM SIMD Acceleration for soft32 Carry-Less Multiplication

Executive Summary

Background: The soft32 Algorithm

WASM SIMD Capabilities

Approach 1: Parallelize the Multiplications

Implementation Sketch

Performance Analysis

Pros/Cons

Approach 2: Batch Processing

Example: Processing 4 clmul operations

Expected speedup: 2-3x per operation in batches of 4+

Approach 3: Bitsliced Implementation (Recommended for Streams)

How Bitslicing Works

Bitsliced Carry-Less Multiplication

Performance Analysis

Pros/Cons

Approach 4: Karatsuba Decomposition

Recommendation for mpz/clmul

1. For Single Operations: Parallel Multiplies (Approach 1)

2. For Block Processing (GHASH): Bitsliced (Approach 3)

3. Implementation Priority

Comparison with Other Approaches

Code Example

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions