Skip to content

wasm acceleration for gf2_128_mul used in qs check #365

@themighty1

Description

@themighty1

In wasm builds, the QuickSilver check accounts for ~40% of the protocol runtime, so it may make sense to accelerate gf2_128 multiplication with WASM SIMD.

Claude suggested the following approaches, even though it probably overestimated perf gains. But seems plausible that there is at least a 2x improvement:

WASM SIMD Acceleration for soft32 Carry-Less Multiplication

Executive Summary

Yes, the soft32 implementation can be accelerated using 128-bit WASM SIMD, but the achievable speedup depends heavily on the workload pattern:

Scenario Expected Speedup Recommended Approach
Single clmul operation 1.5-2x Parallel multiplications
Small batches (2-8 ops) 2-3x Interleaved SIMD
Large batches (128+ ops) 10-20x Bitsliced implementation
GHASH/GCM streams 15-25x Bitsliced + pipelining

Background: The soft32 Algorithm

The soft32.rs implementation almost certainly uses the BearSSL-style "holes" algorithm for constant-time carry-less multiplication:

Algorithm: bmul32(x, y) → 64-bit carry-less product

1. Create bitmasks with 3-bit holes:
   m1 = 0x11111111  (0001_0001_0001_...)
   m2 = 0x22222222  (0010_0010_0010_...)
   m4 = 0x44444444  (0100_0100_0100_...)
   m8 = 0x88888888  (1000_1000_1000_...)

2. Extract every 4th bit: x0 = x & m1, x1 = x & m2, etc.

3. Perform 16 integer multiplications with XOR combining:
   z0 = (x0*y0) ^ (x1*y3) ^ (x2*y2) ^ (x3*y1)
   z1 = (x0*y1) ^ (x1*y0) ^ (x2*y3) ^ (x3*y2)
   z2 = (x0*y2) ^ (x1*y1) ^ (x2*y0) ^ (x3*y3)
   z3 = (x0*y3) ^ (x1*y2) ^ (x2*y1) ^ (x3*y0)

4. Mask and merge: result = (z0 & m1_64) | (z1 & m2_64) | ...

Why it works: The 3-bit holes prevent carries from "spilling" between data bits. With 8 data bits per operand (32/4), worst-case carry accumulation is 8 = 0b1000, which fits in 3 bits.


WASM SIMD Capabilities

WebAssembly SIMD provides 128-bit vectors with these relevant operations:

Operation What it does Useful for
i64x2_extmul_low_u32x4 2× widening 32→64 multiply Parallel muls
i64x2_extmul_high_u32x4 2× widening 32→64 multiply Parallel muls
i32x4_mul 4× truncating 32-bit multiply Limited use
v128_xor 128-bit XOR Core operation
v128_and 128-bit AND Masking
i64x2_shl / i64x2_shr Shift lanes Bitslicing

Critical limitation: WASM SIMD has no CLMUL instruction. We must implement carry-less multiplication using regular arithmetic.


Approach 1: Parallelize the Multiplications

The soft32 algorithm performs 16 scalar 32×32→64 multiplications. With WASM SIMD's i64x2_extmul, we can do 2 at a time.

Implementation Sketch

// Pack operands for parallel multiplication
let x_vec = u32x4(x0, x1, x2, x3);
let y_for_z0 = u32x4(y0, y3, y2, y1);

// Two multiplies at once: lanes [0,1] → two u64 results
let z0_lo = i64x2_extmul_low_u32x4(x_vec, y_for_z0);   // x0*y0, x1*y3
let z0_hi = i64x2_extmul_high_u32x4(x_vec, y_for_z0);  // x2*y2, x3*y1

// XOR the results together
let z0_xor = v128_xor(z0_lo, z0_hi);
let z0 = extract_lane_0(z0_xor) ^ extract_lane_1(z0_xor);

Performance Analysis

  • Multiplications: 16 scalar → 8 SIMD ops (2x parallel)
  • Overhead: Lane extraction and scalar XOR for final reduction
  • Expected speedup: 1.5-2x for single operations

Pros/Cons

✅ Drop-in replacement for scalar version
✅ No data transposition needed
❌ Horizontal XOR reduction is expensive
❌ Limited parallelism (only 2x per multiply group)


Approach 2: Batch Processing

Process multiple independent clmul operations together to better utilize SIMD width.

Example: Processing 4 clmul operations

// Interleave 4 independent clmul operations
// Pack: [x0_a, x0_b, x0_c, x0_d] as u32x4
// This gives 4-way parallelism on the masking/extraction phase

Expected speedup: 2-3x per operation in batches of 4+


Approach 3: Bitsliced Implementation (Recommended for Streams)

For maximum SIMD efficiency, bitslice the data. This transposes the representation so that bit N of 128 different inputs is stored in one v128.

How Bitslicing Works

Normal representation:
  value_0 = [bit0, bit1, bit2, ..., bit31]
  value_1 = [bit0, bit1, bit2, ..., bit31]
  ...
  value_127 = [bit0, bit1, bit2, ..., bit31]

Bitsliced representation:
  slice_0 = [value_0.bit0, value_1.bit0, ..., value_127.bit0]  // 128 bits
  slice_1 = [value_0.bit1, value_1.bit1, ..., value_127.bit1]
  ...
  slice_31 = [value_0.bit31, ..., value_127.bit31]

Bitsliced Carry-Less Multiplication

Once data is bitsliced, clmul becomes a series of XOR and AND operations:

// For each bit position i in a, and j in b:
// If both bits are set, XOR into result bit (i+j)
for i in 0..32 {
    for j in 0..32 {
        let product_bit = v128_and(a_bits[i], b_bits[j]);
        result[i + j] = v128_xor(result[i + j], product_bit);
    }
}

Performance Analysis

  • Operations: 32 × 32 = 1024 AND + 1024 XOR for 128 parallel clmuls
  • Per-operation cost: ~8 AND + 8 XOR (amortized)
  • Expected speedup: 10-20x over scalar for large batches

Pros/Cons

✅ Maximum SIMD utilization (128-way parallelism)
✅ All operations are simple XOR/AND (very fast)
✅ Naturally constant-time
❌ Requires data transposition (expensive for small batches)
❌ Memory bandwidth for bitsliced representation


Approach 4: Karatsuba Decomposition

For 64-bit or 128-bit clmul (as used in GHASH), use Karatsuba to reduce multiplications:

64-bit clmul using 32-bit building blocks:

a = a_hi * 2^32 + a_lo
b = b_hi * 2^32 + b_lo

Normal: 4 multiplications
  a_lo * b_lo, a_lo * b_hi, a_hi * b_lo, a_hi * b_hi

Karatsuba: 3 multiplications
  p_lo  = clmul32(a_lo, b_lo)
  p_hi  = clmul32(a_hi, b_hi)  
  p_mid = clmul32(a_lo ^ a_hi, b_lo ^ b_hi)
  
  cross = p_mid ^ p_lo ^ p_hi  // = a_lo*b_hi + a_hi*b_lo
  result = p_lo | (cross << 32) | (p_hi << 64)

Benefit: 25% fewer multiplications, which compounds with SIMD parallelization.


Recommendation for mpz/clmul

Given that this is for TLSNotary (a GHASH/GCM workload), I recommend:

1. For Single Operations: Parallel Multiplies (Approach 1)

  • Immediate ~1.5x speedup
  • Minimal code changes

2. For Block Processing (GHASH): Bitsliced (Approach 3)

  • Restructure to process 128 blocks at a time
  • Use Karatsuba for the 128-bit field multiplication
  • Expected 15-25x speedup over scalar soft32

3. Implementation Priority

wasm32-simd/
├── soft32_simd.rs      # Parallel multiply version (quick win)
├── bitsliced.rs        # For batch processing
└── ghash_simd.rs       # GHASH-specific optimizations

Comparison with Other Approaches

Implementation Cycles/clmul32 (est.) Notes
Scalar soft32 ~50-80 Baseline
WASM SIMD parallel ~30-50 Approach 1
WASM SIMD bitsliced ~4-8 Per-op amortized
x86 PCLMULQDQ ~3-7 Hardware (not available in WASM)

Code Example

See the accompanying wasm_simd_clmul.rs for complete implementations of:

  • Parallel multiply version (clmul_soft32_simd_v1)
  • Batched processing (clmul_batch_2)
  • Bitsliced multiplication (clmul_bitsliced_32x128)
  • Karatsuba 64-bit clmul (clmul64_karatsuba)

Conclusion

WASM SIMD can meaningfully accelerate the soft32 clmul implementation:

  1. Quick win (1.5-2x): Use i64x2_extmul to parallelize the 16 multiplications
  2. Maximum performance (10-20x): Bitslice the data for stream processing
  3. Combine with Karatsuba for 64/128-bit operations

The best approach depends on your access patterns. For GHASH workloads processing many blocks, the bitsliced approach will dominate. For occasional single operations, the parallel multiply approach offers good improvement with minimal complexity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions