ENH,SIMD Optimize MNNDeconvRunForUnitDepthWise with RVV Implementation #3843
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
This PR introduces an RVV (RISC-V Vector Extension) optimized implementation for the MNNDeconvRunForUnitDepthWise function, aiming to significantly accelerate depthwise deconvolution operations on RISC-V platforms by leveraging SIMD vector instructions.
Optimization Strategy
We focused on vectorizing the core computation using RVV intrinsics:
Vector length (vl): Set to 4 floats for optimal SIMD utilization.
Load/store optimization: Source (src) and weight (weight) arrays are loaded as vfloat32m1_t vectors.
Fused multiply-accumulate: Each src vector is updated with vsrc = vsrc + vdst * vweight using vfmacc_vv_f32m1.
This approach replaces the previous Vec4 SIMD-based implementation for better performance on RISC-V vector hardware.
Performance Summary
Test Hardware: Banana Pi BPI-F3
Operating System: EulixOS 3.0
Test Coverage: Test cases covered a wide range of matrix configurations, from small to large sizes, with various aspect ratios and strides.
Key Findings
Across all test scenarios, the non-unrolled scheme (Scheme 0) demonstrated the best and most stable performance, with its advantage being most pronounced on large matrices.
For large square matrices (e.g., 65536*65536, 512x512, 1024x1024), this scheme achieved a consistent speedup of approximately 2.5x compared to the scalar version.
Representative Data
Detailed Performance for Large Filters
The following data for large input widths and filter sizes clearly illustrates the superiority of the RVV-optimized implementation over the original Vec4 version:
1024x1024 Matrix Multiplication
widthC4=1024, height=1024
Scalar time:0.007185 ms
RVV time:0.002086 ms
SpeedUp:3.44
512x512 Matrix Multiplication
widthC4=512, height=512
Scalar time:0.003550 ms
RVV time: 0.000794 ms
Speedup:4.47
655536*65536 Matrix Multiplication
widthC4=65536, height=65536
Scalar time:0.495102 ms
RVV time:0.192677ms
SpeedUp:2.57
Future Work
This submission is part of an ongoing effort to optimize MNN functions using RVV. We will continue to optimize other core functions to comprehensively enhance MNN's inference performance on the RISC-V platform.