Skip to content

f16-f32acc-approxgelu for WebAssembly and native#10403

Merged
copybara-service[bot] merged 1 commit into
masterfrom
test_924400397
Jun 12, 2026
Merged

f16-f32acc-approxgelu for WebAssembly and native#10403
copybara-service[bot] merged 1 commit into
masterfrom
test_924400397

Conversation

@copybara-service

@copybara-service copybara-service Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

f16-f32acc-approxgelu for WebAssembly and native

Was scalar
Inference (avg):      133760.34 ms (2 runs)
 3.32%  xnn_f16_vapproxgelu_ukernel__scalar_rational_6_4_div_u4

Now avx512skx
Inference (avg):      124561.43 ms (2 runs)
 0.03% xnn_f16_f32acc_vapproxgelu_ukernel__avx512f_rational_6_4_div_u32
Architecture Status Details
WASM relaxed SIMD Added New native FP16 and f32acc kernels.
x86 avx512fp16 Added New native FP16 microkernel (u32).
x86 avx512f Added New microkernel using f16_f32acc.
x86 f16c Added New microkernel using f16_f32acc.
Hexagon hvx Added New f16_f32acc microkernel (u128).
ARM neonfp16 Added New fallback using f16_f32acc.
RISC-V rvvfp16 Improved Increased loop unroll u2v to u8v.
Fallback scalar Improved Switched to f16_f32acc for precision.

@copybara-service copybara-service Bot changed the title f16-approxgelu add f16-f32acc for WebAssembly f16-f32acc-approxgelu for WebAssembly and native Jun 11, 2026
@copybara-service copybara-service Bot force-pushed the test_924400397 branch 3 times, most recently from 3bfeee3 to a4f3c69 Compare June 12, 2026 01:24
```gemma4 12b on AMD Zen4 is 7.38% faster end to end
Was scalar
Inference (avg):      133760.34 ms (2 runs)
 3.32%  xnn_f16_vapproxgelu_ukernel__scalar_rational_6_4_div_u4

Now avx512skx
Inference (avg):      124561.43 ms (2 runs)
 0.03% xnn_f16_f32acc_vapproxgelu_ukernel__avx512f_rational_6_4_div_u32
```

| Architecture      | Status   | Details                               |
| :---------------- | :------- | :------------------------------------ |
| WASM relaxed SIMD | Added    | New native FP16 and f32acc kernels.   |
| x86 avx512fp16    | Added    | New native FP16 microkernel (u32).    |
| x86 avx512f       | Added    | New microkernel using f16_f32acc.     |
| x86 f16c          | Added    | New microkernel using f16_f32acc.     |
| Hexagon hvx       | Added    | New f16_f32acc microkernel (u128).    |
| ARM neonfp16      | Added    | New fallback using f16_f32acc.        |
| RISC-V rvvfp16    | Improved | Increased loop unroll u2v to u8v.     |
| Fallback scalar   | Improved | Switched to f16_f32acc for precision. |

PiperOrigin-RevId: 930857873
@copybara-service copybara-service Bot merged commit ff27209 into master Jun 12, 2026
@copybara-service copybara-service Bot deleted the test_924400397 branch June 12, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant