x86: optimize Reduction with SIMD#6743
Conversation
Summary: Add a dedicated perf target for the Reduction layer following the existing perf harness style (PerfMat, perf_layer, perfutil). This enables standardized before/after performance measurement for the Reduction x86 optimization work. Changes: 1. Add tests/perf/perf_reduction.cpp with representative Reduction cases 2. Register perf_reduction target in tests/perf/CMakeLists.txt
Summary: Implement an x86-specific Reduction layer override for fp32 pack1 tensors with SIMD-accelerated reduction loops. This provides significant speedups over the generic scalar implementation, especially for strided and multi-axis reductions. Changes: 1. Add src/layer/x86/reduction_x86.h with Reduction_x86 class declaration 2. Add src/layer/x86/reduction_x86.cpp with optimized forward dispatch 3. Support SUM, ASUM, SUMSQ, MEAN, MAX, MIN, L1, L2, LogSum operations 4. Cover all non-empty reduction flag combinations for 2D/3D/4D inputs 5. Fall back to generic Reduction for unsupported ops (PROD, LogSumExp)
Summary: Add PROD and LogSumExp operation support to the x86 Reduction layer and remove the generic fallback in forward(). The x86 implementation now covers all 11 Reduction operations and all dimension/flag combinations for fp32 pack1 tensors, eliminating the need for the base class forward dispatch. Changes: 1. Add SIMD-accelerated PROD reduction with element-wise multiplication 2. Add SIMD-accelerated LogSumExp reduction using exp/sum/log pipeline 3. Include sse_mathfun.h, avx_mathfun.h, avx512_mathfun.h for exp_ps 4. Remove generic Reduction::forward fallback from Reduction_x86::forward 5. Remove unused reduction_x86_shape_supported function
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 868ec02e58
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (; i + 3 < size; i += 4) | ||
| { | ||
| __m128 _p = _mm_loadu_ps(ptr); | ||
| _p = _mm_exp_ps(_p); |
There was a problem hiding this comment.
Replace undefined SIMD exp calls with mathfun helpers
This introduces compile-time failures on x86 because reduction_x86_sumexp (and the analogous LogSumExp accumulation path) calls _mm_exp_ps/_mm256_exp_ps/_mm512_exp_ps, but those symbols are not provided by sse_mathfun.h/avx_mathfun.h/avx512_mathfun.h (the available helpers are exp_ps, exp256_ps, and exp512_ps). I confirmed this by building test_reduction in this repo (cmake --build build-review --target test_reduction), which fails in reduction_x86.cpp with “not declared in this scope” errors.
Useful? React with 👍 / 👎.
Summary
Reductionlayer override for fp32 pack1 tensors.perf_reductiontarget using the shared perf harness style.Implementation
src/layer/x86/reduction_x86.handsrc/layer/x86/reduction_x86.cpp.SUM,ASUM,SUMSQ,MEAN,MAX,MIN,PROD,L1,L2,LogSum,LogSumExp.tests/perf/perf_reduction.cppand registered it intests/perf/CMakeLists.txt.Performance
Measured with
perf_reductionon Windows/MSVC Release,NCNN_OPENMP=OFF,NCNN_VULKAN=OFF. The perf harness reports aggregate timings with an inner-loop count(xN), so the table normalizes fp32minasreported min / N.[1048576]sum axes=all keep=0[1024,1024]sum axis=1 keep=0[1024,1024]sum axis=0 keep=0[56,56,64]mean axes=1,2 keep=1[56,56,64]max axis=0 keep=0[64,8,8,32]sum axes=0,1,2 keep=0[16,16,8,64]l2 axis=0 keep=0[16,16,8,64]asum axes=2,3 keep=0[16,16,8,64]sumsq axes=0,2,3 keep=1