Add Conv-based STFT variants for the ONNX preprocessors#124
Open
intexcor wants to merge 3 commits into
Open
Conversation
op.STFT has no kernel in the onnxruntime CUDA execution provider: a preprocessor graph that uses it gets split, and the STFT node runs on CPU with host/device copies around it. Accelerators such as CoreML do not support it either, and for non-power-of-2 FFT sizes it is slow on CPU. Add a shared preprocessors/stft.py helper that expresses the windowed DFT as a 1d convolution with a fixed kernel, plus Conv-based variants of every STFT-using preprocessor: gigaam_v2/v3, nemo80/128, whisper80/128 and kaldi. The new graphs use only operators with kernels on every execution provider, so they run fully on GPU; they are numerically equivalent to the STFT graphs.
PreprocessorRuntimeConfig gains a use_conv_preprocessors flag that selects the Conv-based ONNX preprocessor variants. It defaults to auto: enabled when a CUDA or TensorRT execution provider is used, disabled otherwise. When the Conv preprocessors are used the CUDA provider is no longer excluded from the preprocessor session (op.STFT has no CUDA kernel, the Conv graph does), so preprocessing runs on the GPU instead of falling back to a NumPy/CPU implementation.
Parametrize the preprocessor tests over the Conv variants, update the build file counts, and cover use_conv_preprocessors selection in the Manager and preprocessor-option tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ONNX Runtime has no provider-agnostic implementation of
op.STFT. TensorRT and DirectML each ship their own lowering (applied at model load), but the CUDA execution provider has no STFT kernel — theSTFTnode falls back to CPU with host/device copies around it — and the CPU implementation is slow for non-power-of-2 FFT sizes.This PR does that lowering once, at the ONNX-graph level: the windowed DFT is expressed as a 1d convolution (the cos/sin Fourier basis multiplied by the analysis window, as a fixed
Convkernel). The resulting graph uses only operators that have kernels on every execution provider, so a single ONNX preprocessor runs natively on CPU, CUDA, TensorRT, CoreML and DirectML.Changes
preprocessors/stft.py— shared helper:stft_conv_weights()builds the Conv kernel,conv_power_spectrogram()is the Conv-based STFT subgraph.gigaam_v2/v3,nemo80/128,whisper80/128,kaldi, built as<name>_conv.onnx.use_conv_preprocessorsflag inPreprocessorRuntimeConfig— selects the Conv variants; auto-enabled for CUDA / TensorRT providers. When enabled, the CUDA provider is no longer excluded from the preprocessor session.Managertests cover the new flag.Numerical equivalence
The Conv graphs are numerically equivalent to the STFT graphs — the existing
tests/preprocessorschecks pass for the Conv variants against the torchaudio / kaldi-native-fbank references, with the existing tolerances.Benchmark (16 s audio, batch 1)
CPU — preprocessor latency:
On CPU the existing NumPy preprocessors are the fastest option and remain the default — this PR does not change that.
CUDA (RTX 3090 Ti, onnxruntime-gpu 1.26):
On the CUDA EP the STFT graph runs the
STFTnode on CPU with memcpy nodes around it; the Conv graph runs entirely on the GPU. This lets the preprocessor stay on-device in a GPU pipeline instead of using the NumPy/CPU fallback.Scope
op.STFTinternally, so the Conv variant is performance-neutral there; it is auto-enabled for CUDA/TensorRT so that a single ONNX graph runs across every provider.wespeakeris left unchanged — it usesop.DFT+op.Scan(the slow/accurate preprocessor path), which is a separate case.