x86: optimize Reduction with SIMD by crafcat7 · Pull Request #6743 · Tencent/ncnn

crafcat7 · 2026-05-25T15:00:26Z

Summary

Add an x86 Reduction layer override for fp32 pack1 tensors.
Cover all 11 Reduction operations with no generic fallback.
Cover contiguous and strided reductions across 1D, 2D, 3D, and 4D inputs.
Add a focused perf_reduction target using the shared perf harness style.

Implementation

New files: src/layer/x86/reduction_x86.h and src/layer/x86/reduction_x86.cpp.
Accelerated operations: SUM, ASUM, SUMSQ, MEAN, MAX, MIN, PROD, L1, L2, LogSum, LogSumExp.
The x86 dispatch explicitly covers all non-empty reduction flag combinations for 2D, 3D, and 4D tensors.
Added tests/perf/perf_reduction.cpp and registered it in tests/perf/CMakeLists.txt.

Performance

Measured with perf_reduction on Windows/MSVC Release, NCNN_OPENMP=OFF, NCNN_VULKAN=OFF. The perf harness reports aggregate timings with an inner-loop count (xN), so the table normalizes fp32 min as reported min / N.

Case	Baseline Generic (ms)	X86 Optimized (ms)	Speedup
`[1048576]` sum axes=all keep=0	0.04947	0.02844	1.74x
`[1024,1024]` sum axis=1 keep=0	0.04723	0.02963	1.59x
`[1024,1024]` sum axis=0 keep=0	3.70000	0.03230	114.55x
`[56,56,64]` mean axes=1,2 keep=1	0.00952	0.00462	2.06x
`[56,56,64]` max axis=0 keep=0	0.03587	0.00443	8.10x
`[64,8,8,32]` sum axes=0,1,2 keep=0	0.05068	0.00663	7.64x
`[16,16,8,64]` l2 axis=0 keep=0	0.14680	0.00716	20.50x
`[16,16,8,64]` asum axes=2,3 keep=0	0.00561	0.00453	1.24x
`[16,16,8,64]` sumsq axes=0,2,3 keep=1	0.00639	0.00409	1.56x

Summary: Add a dedicated perf target for the Reduction layer following the existing perf harness style (PerfMat, perf_layer, perfutil). This enables standardized before/after performance measurement for the Reduction x86 optimization work. Changes: 1. Add tests/perf/perf_reduction.cpp with representative Reduction cases 2. Register perf_reduction target in tests/perf/CMakeLists.txt

Summary: Implement an x86-specific Reduction layer override for fp32 pack1 tensors with SIMD-accelerated reduction loops. This provides significant speedups over the generic scalar implementation, especially for strided and multi-axis reductions. Changes: 1. Add src/layer/x86/reduction_x86.h with Reduction_x86 class declaration 2. Add src/layer/x86/reduction_x86.cpp with optimized forward dispatch 3. Support SUM, ASUM, SUMSQ, MEAN, MAX, MIN, L1, L2, LogSum operations 4. Cover all non-empty reduction flag combinations for 2D/3D/4D inputs 5. Fall back to generic Reduction for unsupported ops (PROD, LogSumExp)

Summary: Add PROD and LogSumExp operation support to the x86 Reduction layer and remove the generic fallback in forward(). The x86 implementation now covers all 11 Reduction operations and all dimension/flag combinations for fp32 pack1 tensors, eliminating the need for the base class forward dispatch. Changes: 1. Add SIMD-accelerated PROD reduction with element-wise multiplication 2. Add SIMD-accelerated LogSumExp reduction using exp/sum/log pipeline 3. Include sse_mathfun.h, avx_mathfun.h, avx512_mathfun.h for exp_ps 4. Remove generic Reduction::forward fallback from Reduction_x86::forward 5. Remove unused reduction_x86_shape_supported function

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 868ec02e58

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T15:07:53Z

+    for (; i + 3 < size; i += 4)
+    {
+        __m128 _p = _mm_loadu_ps(ptr);
+        _p = _mm_exp_ps(_p);


Replace undefined SIMD exp calls with mathfun helpers

This introduces compile-time failures on x86 because reduction_x86_sumexp (and the analogous LogSumExp accumulation path) calls _mm_exp_ps/_mm256_exp_ps/_mm512_exp_ps, but those symbols are not provided by sse_mathfun.h/avx_mathfun.h/avx512_mathfun.h (the available helpers are exp_ps, exp256_ps, and exp512_ps). I confirmed this by building test_reduction in this repo (cmake --build build-review --target test_reduction), which fails in reduction_x86.cpp with “not declared in this scope” errors.

Useful? React with 👍 / 👎.

crafcat7 added 3 commits May 25, 2026 07:48

github-actions Bot added test x86 labels May 25, 2026

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

[fix] reduction: correct SIMD exp function names for x86

ad5d0cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: optimize Reduction with SIMD#6743

x86: optimize Reduction with SIMD#6743
crafcat7 wants to merge 4 commits into
Tencent:masterfrom
crafcat7:feat/x86-reduction

crafcat7 commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crafcat7 commented May 25, 2026

Summary

Implementation

Performance

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant