Skip to content

Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473

Open
wangzihao3 wants to merge 1 commit intoxiph:mainfrom
wangzihao3:main
Open

Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473
wangzihao3 wants to merge 1 commit intoxiph:mainfrom
wangzihao3:main

Conversation

@wangzihao3
Copy link
Copy Markdown

@wangzihao3 wangzihao3 commented May 6, 2026

Summary

This PR applies targeted ARM NEON optimizations to two Silk encoder hot paths (~25% speedup, ~30% fewer instructions), and fixes a precision loss bug in the warped
autocorrelation computation.

Changes

silk/LPC_analysis_filter.c — New NEON implementation

  • Add a NEON-vectorized silk_LPC_analysis_filter() that processes 8 filter taps per iteration using paired vmlal_s16 accumulators (int32x4_t).
  • The original scalar implementation is preserved as silk_LPC_analysis_filter_c() for reference and OPUS_CHECK_ASM validation.

silk/arm/NSQ_del_dec_neon_intr.c — Multiple optimizations

  • LPC prediction: Replace 16 individual vld1q_s32 loads with ldp (load-pair) inline assembly, loading 8 data vectors + 2 coefficient vectors per block. Use
    vqdmulhq_laneq_s32 to index directly into 128-bit coefficient vectors, eliminating vget_low/high_s32 split overhead.
  • Allpass sections: Fully unroll the allpass loop via macro with compile-time branching on shapingLPCOrder, enabling precise register allocation and removing
    loop overhead.
  • Common subexpression elimination: Precompute Tilt_Q14_Q16 and LF_shp_Q14_Q15 outside the per-sample loop; cache LF_AR_Q14 to avoid redundant vector
    loads.
  • AR_shp_Q28 widening: Remove the scalar tail loop since MAX_SHAPE_LPC_ORDER (24) is exactly divisible by 8.
  • State replacement copy: Unroll the numOthers copy loop 16× with __builtin_prefetch to hide memory latency during the candidate state swap.

silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c — Fix precision loss in autocorrelation

  • The original code computed corr_QC[orderT] (the zero-lag autocorrelation) from the raw int16 input using vmull_s16, then left-shifted the result by QC.
    This lost precision because it operated on the un-warped source data instead of the warped signal that the rest of the function processes.
  • The fix computes corr_QC[orderT] from the already-warped input_QS (int32) data using vmull_s32 + vsraq_n_s64 with the correct right shift 2*QS - QC,
    consistent with how all other corr_QC[] entries are accumulated. This ensures the autocorrelation is derived from the same warped signal path, eliminating the
    precision discrepancy.

Performance

Metric Improvement
Encoding speed ~25% faster
Dynamic instruction count ~30% fewer

Signed-off-by: wangzihao <wangzihao18@huawei.com>
@wangzihao3 wangzihao3 changed the title feat: optimize silk encoding performance Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant