Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473
Open
wangzihao3 wants to merge 1 commit intoxiph:mainfrom
Open
Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug#473wangzihao3 wants to merge 1 commit intoxiph:mainfrom
wangzihao3 wants to merge 1 commit intoxiph:mainfrom
Conversation
Signed-off-by: wangzihao <wangzihao18@huawei.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR applies targeted ARM NEON optimizations to two Silk encoder hot paths (~25% speedup, ~30% fewer instructions), and fixes a precision loss bug in the warped
autocorrelation computation.
Changes
silk/LPC_analysis_filter.c — New NEON implementation
silk_LPC_analysis_filter()that processes 8 filter taps per iteration using pairedvmlal_s16accumulators (int32x4_t).silk_LPC_analysis_filter_c()for reference andOPUS_CHECK_ASMvalidation.silk/arm/NSQ_del_dec_neon_intr.c — Multiple optimizations
vld1q_s32loads withldp(load-pair) inline assembly, loading 8 data vectors + 2 coefficient vectors per block. Usevqdmulhq_laneq_s32to index directly into 128-bit coefficient vectors, eliminatingvget_low/high_s32split overhead.shapingLPCOrder, enabling precise register allocation and removingloop overhead.
Tilt_Q14_Q16andLF_shp_Q14_Q15outside the per-sample loop; cacheLF_AR_Q14to avoid redundant vectorloads.
MAX_SHAPE_LPC_ORDER(24) is exactly divisible by 8.numOtherscopy loop 16× with__builtin_prefetchto hide memory latency during the candidate state swap.silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c — Fix precision loss in autocorrelation
corr_QC[orderT](the zero-lag autocorrelation) from the rawint16input usingvmull_s16, then left-shifted the result byQC.This lost precision because it operated on the un-warped source data instead of the warped signal that the rest of the function processes.
corr_QC[orderT]from the already-warpedinput_QS(int32) data usingvmull_s32+vsraq_n_s64with the correct right shift2*QS - QC,consistent with how all other
corr_QC[]entries are accumulated. This ensures the autocorrelation is derived from the same warped signal path, eliminating theprecision discrepancy.
Performance