Skip to content

x86: optimize Reduction with SIMD#6743

Open
crafcat7 wants to merge 4 commits into
Tencent:masterfrom
crafcat7:feat/x86-reduction
Open

x86: optimize Reduction with SIMD#6743
crafcat7 wants to merge 4 commits into
Tencent:masterfrom
crafcat7:feat/x86-reduction

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

Summary

  • Add an x86 Reduction layer override for fp32 pack1 tensors.
  • Cover all 11 Reduction operations with no generic fallback.
  • Cover contiguous and strided reductions across 1D, 2D, 3D, and 4D inputs.
  • Add a focused perf_reduction target using the shared perf harness style.

Implementation

  • New files: src/layer/x86/reduction_x86.h and src/layer/x86/reduction_x86.cpp.
  • Accelerated operations: SUM, ASUM, SUMSQ, MEAN, MAX, MIN, PROD, L1, L2, LogSum, LogSumExp.
  • The x86 dispatch explicitly covers all non-empty reduction flag combinations for 2D, 3D, and 4D tensors.
  • Added tests/perf/perf_reduction.cpp and registered it in tests/perf/CMakeLists.txt.

Performance

Measured with perf_reduction on Windows/MSVC Release, NCNN_OPENMP=OFF, NCNN_VULKAN=OFF. The perf harness reports aggregate timings with an inner-loop count (xN), so the table normalizes fp32 min as reported min / N.

Case Baseline Generic (ms) X86 Optimized (ms) Speedup
[1048576] sum axes=all keep=0 0.04947 0.02844 1.74x
[1024,1024] sum axis=1 keep=0 0.04723 0.02963 1.59x
[1024,1024] sum axis=0 keep=0 3.70000 0.03230 114.55x
[56,56,64] mean axes=1,2 keep=1 0.00952 0.00462 2.06x
[56,56,64] max axis=0 keep=0 0.03587 0.00443 8.10x
[64,8,8,32] sum axes=0,1,2 keep=0 0.05068 0.00663 7.64x
[16,16,8,64] l2 axis=0 keep=0 0.14680 0.00716 20.50x
[16,16,8,64] asum axes=2,3 keep=0 0.00561 0.00453 1.24x
[16,16,8,64] sumsq axes=0,2,3 keep=1 0.00639 0.00409 1.56x

crafcat7 added 3 commits May 25, 2026 07:48
Summary:
  Add a dedicated perf target for the Reduction layer following the
  existing perf harness style (PerfMat, perf_layer, perfutil). This
  enables standardized before/after performance measurement for the
  Reduction x86 optimization work.

Changes:
  1. Add tests/perf/perf_reduction.cpp with representative Reduction cases
  2. Register perf_reduction target in tests/perf/CMakeLists.txt
Summary:
  Implement an x86-specific Reduction layer override for fp32 pack1
  tensors with SIMD-accelerated reduction loops. This provides
  significant speedups over the generic scalar implementation,
  especially for strided and multi-axis reductions.

Changes:
  1. Add src/layer/x86/reduction_x86.h with Reduction_x86 class declaration
  2. Add src/layer/x86/reduction_x86.cpp with optimized forward dispatch
  3. Support SUM, ASUM, SUMSQ, MEAN, MAX, MIN, L1, L2, LogSum operations
  4. Cover all non-empty reduction flag combinations for 2D/3D/4D inputs
  5. Fall back to generic Reduction for unsupported ops (PROD, LogSumExp)
Summary:
  Add PROD and LogSumExp operation support to the x86 Reduction layer
  and remove the generic fallback in forward(). The x86 implementation
  now covers all 11 Reduction operations and all dimension/flag
  combinations for fp32 pack1 tensors, eliminating the need for the
  base class forward dispatch.

Changes:
  1. Add SIMD-accelerated PROD reduction with element-wise multiplication
  2. Add SIMD-accelerated LogSumExp reduction using exp/sum/log pipeline
  3. Include sse_mathfun.h, avx_mathfun.h, avx512_mathfun.h for exp_ps
  4. Remove generic Reduction::forward fallback from Reduction_x86::forward
  5. Remove unused reduction_x86_shape_supported function
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 868ec02e58

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/layer/x86/reduction_x86.cpp Outdated
for (; i + 3 < size; i += 4)
{
__m128 _p = _mm_loadu_ps(ptr);
_p = _mm_exp_ps(_p);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Replace undefined SIMD exp calls with mathfun helpers

This introduces compile-time failures on x86 because reduction_x86_sumexp (and the analogous LogSumExp accumulation path) calls _mm_exp_ps/_mm256_exp_ps/_mm512_exp_ps, but those symbols are not provided by sse_mathfun.h/avx_mathfun.h/avx512_mathfun.h (the available helpers are exp_ps, exp256_ps, and exp512_ps). I confirmed this by building test_reduction in this repo (cmake --build build-review --target test_reduction), which fails in reduction_x86.cpp with “not declared in this scope” errors.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant