-
Notifications
You must be signed in to change notification settings - Fork 73
Description
In wasm builds, the QuickSilver check accounts for ~40% of the protocol runtime, so it may make sense to accelerate gf2_128 multiplication with WASM SIMD.
Claude suggested the following approaches, even though it probably overestimated perf gains. But seems plausible that there is at least a 2x improvement:
WASM SIMD Acceleration for soft32 Carry-Less Multiplication
Executive Summary
Yes, the soft32 implementation can be accelerated using 128-bit WASM SIMD, but the achievable speedup depends heavily on the workload pattern:
| Scenario | Expected Speedup | Recommended Approach |
|---|---|---|
| Single clmul operation | 1.5-2x | Parallel multiplications |
| Small batches (2-8 ops) | 2-3x | Interleaved SIMD |
| Large batches (128+ ops) | 10-20x | Bitsliced implementation |
| GHASH/GCM streams | 15-25x | Bitsliced + pipelining |
Background: The soft32 Algorithm
The soft32.rs implementation almost certainly uses the BearSSL-style "holes" algorithm for constant-time carry-less multiplication:
Algorithm: bmul32(x, y) → 64-bit carry-less product
1. Create bitmasks with 3-bit holes:
m1 = 0x11111111 (0001_0001_0001_...)
m2 = 0x22222222 (0010_0010_0010_...)
m4 = 0x44444444 (0100_0100_0100_...)
m8 = 0x88888888 (1000_1000_1000_...)
2. Extract every 4th bit: x0 = x & m1, x1 = x & m2, etc.
3. Perform 16 integer multiplications with XOR combining:
z0 = (x0*y0) ^ (x1*y3) ^ (x2*y2) ^ (x3*y1)
z1 = (x0*y1) ^ (x1*y0) ^ (x2*y3) ^ (x3*y2)
z2 = (x0*y2) ^ (x1*y1) ^ (x2*y0) ^ (x3*y3)
z3 = (x0*y3) ^ (x1*y2) ^ (x2*y1) ^ (x3*y0)
4. Mask and merge: result = (z0 & m1_64) | (z1 & m2_64) | ...
Why it works: The 3-bit holes prevent carries from "spilling" between data bits. With 8 data bits per operand (32/4), worst-case carry accumulation is 8 = 0b1000, which fits in 3 bits.
WASM SIMD Capabilities
WebAssembly SIMD provides 128-bit vectors with these relevant operations:
| Operation | What it does | Useful for |
|---|---|---|
i64x2_extmul_low_u32x4 |
2× widening 32→64 multiply | Parallel muls |
i64x2_extmul_high_u32x4 |
2× widening 32→64 multiply | Parallel muls |
i32x4_mul |
4× truncating 32-bit multiply | Limited use |
v128_xor |
128-bit XOR | Core operation |
v128_and |
128-bit AND | Masking |
i64x2_shl / i64x2_shr |
Shift lanes | Bitslicing |
Critical limitation: WASM SIMD has no CLMUL instruction. We must implement carry-less multiplication using regular arithmetic.
Approach 1: Parallelize the Multiplications
The soft32 algorithm performs 16 scalar 32×32→64 multiplications. With WASM SIMD's i64x2_extmul, we can do 2 at a time.
Implementation Sketch
// Pack operands for parallel multiplication
let x_vec = u32x4(x0, x1, x2, x3);
let y_for_z0 = u32x4(y0, y3, y2, y1);
// Two multiplies at once: lanes [0,1] → two u64 results
let z0_lo = i64x2_extmul_low_u32x4(x_vec, y_for_z0); // x0*y0, x1*y3
let z0_hi = i64x2_extmul_high_u32x4(x_vec, y_for_z0); // x2*y2, x3*y1
// XOR the results together
let z0_xor = v128_xor(z0_lo, z0_hi);
let z0 = extract_lane_0(z0_xor) ^ extract_lane_1(z0_xor);
Performance Analysis
- Multiplications: 16 scalar → 8 SIMD ops (2x parallel)
- Overhead: Lane extraction and scalar XOR for final reduction
- Expected speedup: 1.5-2x for single operations
Pros/Cons
✅ Drop-in replacement for scalar version
✅ No data transposition needed
❌ Horizontal XOR reduction is expensive
❌ Limited parallelism (only 2x per multiply group)
Approach 2: Batch Processing
Process multiple independent clmul operations together to better utilize SIMD width.
Example: Processing 4 clmul operations
// Interleave 4 independent clmul operations
// Pack: [x0_a, x0_b, x0_c, x0_d] as u32x4
// This gives 4-way parallelism on the masking/extraction phase
Expected speedup: 2-3x per operation in batches of 4+
Approach 3: Bitsliced Implementation (Recommended for Streams)
For maximum SIMD efficiency, bitslice the data. This transposes the representation so that bit N of 128 different inputs is stored in one v128.
How Bitslicing Works
Normal representation:
value_0 = [bit0, bit1, bit2, ..., bit31]
value_1 = [bit0, bit1, bit2, ..., bit31]
...
value_127 = [bit0, bit1, bit2, ..., bit31]
Bitsliced representation:
slice_0 = [value_0.bit0, value_1.bit0, ..., value_127.bit0] // 128 bits
slice_1 = [value_0.bit1, value_1.bit1, ..., value_127.bit1]
...
slice_31 = [value_0.bit31, ..., value_127.bit31]
Bitsliced Carry-Less Multiplication
Once data is bitsliced, clmul becomes a series of XOR and AND operations:
// For each bit position i in a, and j in b:
// If both bits are set, XOR into result bit (i+j)
for i in 0..32 {
for j in 0..32 {
let product_bit = v128_and(a_bits[i], b_bits[j]);
result[i + j] = v128_xor(result[i + j], product_bit);
}
}
Performance Analysis
- Operations: 32 × 32 = 1024 AND + 1024 XOR for 128 parallel clmuls
- Per-operation cost: ~8 AND + 8 XOR (amortized)
- Expected speedup: 10-20x over scalar for large batches
Pros/Cons
✅ Maximum SIMD utilization (128-way parallelism)
✅ All operations are simple XOR/AND (very fast)
✅ Naturally constant-time
❌ Requires data transposition (expensive for small batches)
❌ Memory bandwidth for bitsliced representation
Approach 4: Karatsuba Decomposition
For 64-bit or 128-bit clmul (as used in GHASH), use Karatsuba to reduce multiplications:
64-bit clmul using 32-bit building blocks:
a = a_hi * 2^32 + a_lo
b = b_hi * 2^32 + b_lo
Normal: 4 multiplications
a_lo * b_lo, a_lo * b_hi, a_hi * b_lo, a_hi * b_hi
Karatsuba: 3 multiplications
p_lo = clmul32(a_lo, b_lo)
p_hi = clmul32(a_hi, b_hi)
p_mid = clmul32(a_lo ^ a_hi, b_lo ^ b_hi)
cross = p_mid ^ p_lo ^ p_hi // = a_lo*b_hi + a_hi*b_lo
result = p_lo | (cross << 32) | (p_hi << 64)
Benefit: 25% fewer multiplications, which compounds with SIMD parallelization.
Recommendation for mpz/clmul
Given that this is for TLSNotary (a GHASH/GCM workload), I recommend:
1. For Single Operations: Parallel Multiplies (Approach 1)
- Immediate ~1.5x speedup
- Minimal code changes
2. For Block Processing (GHASH): Bitsliced (Approach 3)
- Restructure to process 128 blocks at a time
- Use Karatsuba for the 128-bit field multiplication
- Expected 15-25x speedup over scalar soft32
3. Implementation Priority
wasm32-simd/
├── soft32_simd.rs # Parallel multiply version (quick win)
├── bitsliced.rs # For batch processing
└── ghash_simd.rs # GHASH-specific optimizations
Comparison with Other Approaches
| Implementation | Cycles/clmul32 (est.) | Notes |
|---|---|---|
| Scalar soft32 | ~50-80 | Baseline |
| WASM SIMD parallel | ~30-50 | Approach 1 |
| WASM SIMD bitsliced | ~4-8 | Per-op amortized |
| x86 PCLMULQDQ | ~3-7 | Hardware (not available in WASM) |
Code Example
See the accompanying wasm_simd_clmul.rs for complete implementations of:
- Parallel multiply version (
clmul_soft32_simd_v1) - Batched processing (
clmul_batch_2) - Bitsliced multiplication (
clmul_bitsliced_32x128) - Karatsuba 64-bit clmul (
clmul64_karatsuba)
Conclusion
WASM SIMD can meaningfully accelerate the soft32 clmul implementation:
- Quick win (1.5-2x): Use
i64x2_extmulto parallelize the 16 multiplications - Maximum performance (10-20x): Bitslice the data for stream processing
- Combine with Karatsuba for 64/128-bit operations
The best approach depends on your access patterns. For GHASH workloads processing many blocks, the bitsliced approach will dominate. For occasional single operations, the parallel multiply approach offers good improvement with minimal complexity.