Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug by wangzihao3 · Pull Request #473 · xiph/opus

wangzihao3 · 2026-05-06T06:33:51Z

Summary

This PR applies targeted ARM NEON optimizations to two Silk encoder hot paths (~25% speedup, ~30% fewer instructions), and fixes a precision loss bug in the warped
autocorrelation computation.

Changes

silk/LPC_analysis_filter.c — New NEON implementation

Add a NEON-vectorized silk_LPC_analysis_filter() that processes 8 filter taps per iteration using paired vmlal_s16 accumulators (int32x4_t).
The original scalar implementation is preserved as silk_LPC_analysis_filter_c() for reference and OPUS_CHECK_ASM validation.

silk/arm/NSQ_del_dec_neon_intr.c — Multiple optimizations

LPC prediction: Replace 16 individual vld1q_s32 loads with ldp (load-pair) inline assembly, loading 8 data vectors + 2 coefficient vectors per block. Use
vqdmulhq_laneq_s32 to index directly into 128-bit coefficient vectors, eliminating vget_low/high_s32 split overhead.
Allpass sections: Fully unroll the allpass loop via macro with compile-time branching on shapingLPCOrder, enabling precise register allocation and removing
loop overhead.
Common subexpression elimination: Precompute Tilt_Q14_Q16 and LF_shp_Q14_Q15 outside the per-sample loop; cache LF_AR_Q14 to avoid redundant vector
loads.
AR_shp_Q28 widening: Remove the scalar tail loop since MAX_SHAPE_LPC_ORDER (24) is exactly divisible by 8.
State replacement copy: Unroll the numOthers copy loop 16× with __builtin_prefetch to hide memory latency during the candidate state swap.

silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c — Fix precision loss in autocorrelation

The original code computed corr_QC[orderT] (the zero-lag autocorrelation) from the raw int16 input using vmull_s16, then left-shifted the result by QC.
This lost precision because it operated on the un-warped source data instead of the warped signal that the rest of the function processes.
The fix computes corr_QC[orderT] from the already-warped input_QS (int32) data using vmull_s32 + vsraq_n_s64 with the correct right shift 2*QS - QC,
consistent with how all other corr_QC[] entries are accumulated. This ensures the autocorrelation is derived from the same warped signal path, eliminating the
precision discrepancy.

Performance

Metric	Improvement
Encoding speed	~25% faster
Dynamic instruction count	~30% fewer

Signed-off-by: wangzihao <wangzihao18@huawei.com>

feat: optimize silk encoding performance

5f9910a

Signed-off-by: wangzihao <wangzihao18@huawei.com>

wangzihao3 changed the title ~~feat: optimize silk encoding performance~~ Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473

Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473
wangzihao3 wants to merge 1 commit intoxiph:mainfrom
wangzihao3:main

wangzihao3 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangzihao3 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangzihao3 commented May 6, 2026 •

edited

Loading