Skip to content

Add Conv-based STFT variants for the ONNX preprocessors#124

Open
intexcor wants to merge 3 commits into
istupakov:mainfrom
intexcor:conv-stft-preprocessors
Open

Add Conv-based STFT variants for the ONNX preprocessors#124
intexcor wants to merge 3 commits into
istupakov:mainfrom
intexcor:conv-stft-preprocessors

Conversation

@intexcor
Copy link
Copy Markdown

@intexcor intexcor commented May 17, 2026

ONNX Runtime has no provider-agnostic implementation of op.STFT. TensorRT and DirectML each ship their own lowering (applied at model load), but the CUDA execution provider has no STFT kernel — the STFT node falls back to CPU with host/device copies around it — and the CPU implementation is slow for non-power-of-2 FFT sizes.

This PR does that lowering once, at the ONNX-graph level: the windowed DFT is expressed as a 1d convolution (the cos/sin Fourier basis multiplied by the analysis window, as a fixed Conv kernel). The resulting graph uses only operators that have kernels on every execution provider, so a single ONNX preprocessor runs natively on CPU, CUDA, TensorRT, CoreML and DirectML.

Changes

  • preprocessors/stft.py — shared helper: stft_conv_weights() builds the Conv kernel, conv_power_spectrogram() is the Conv-based STFT subgraph.
  • Conv-based variants of every STFT preprocessor: gigaam_v2/v3, nemo80/128, whisper80/128, kaldi, built as <name>_conv.onnx.
  • use_conv_preprocessors flag in PreprocessorRuntimeConfig — selects the Conv variants; auto-enabled for CUDA / TensorRT providers. When enabled, the CUDA provider is no longer excluded from the preprocessor session.
  • Preprocessor tests parametrized over the Conv variants; Manager tests cover the new flag.

Numerical equivalence

The Conv graphs are numerically equivalent to the STFT graphs — the existing tests/preprocessors checks pass for the Conv variants against the torchaudio / kaldi-native-fbank references, with the existing tolerances.

Benchmark (16 s audio, batch 1)

CPU — preprocessor latency:

preprocessor NumPy Conv ONNX STFT ONNX
gigaam_v2 1.5 ms 2.9 ms 39 ms
whisper80 3.3 ms 5.4 ms 74 ms
nemo80 2.9 ms 4.7 ms 11 ms

On CPU the existing NumPy preprocessors are the fastest option and remain the default — this PR does not change that.

CUDA (RTX 3090 Ti, onnxruntime-gpu 1.26):

preprocessor STFT ONNX Conv ONNX
gigaam_v2 156 ms 0.96 ms
whisper80 292 ms 1.38 ms
nemo80 35 ms 2.74 ms

On the CUDA EP the STFT graph runs the STFT node on CPU with memcpy nodes around it; the Conv graph runs entirely on the GPU. This lets the preprocessor stay on-device in a GPU pipeline instead of using the NumPy/CPU fallback.

Scope

  • The value of the Conv variant is running the preprocessor on an accelerator (CUDA EP) — on CPU the NumPy preprocessors are faster, and this PR keeps them as the CPU default.
  • TensorRT and DirectML already lower op.STFT internally, so the Conv variant is performance-neutral there; it is auto-enabled for CUDA/TensorRT so that a single ONNX graph runs across every provider.
  • wespeaker is left unchanged — it uses op.DFT + op.Scan (the slow/accurate preprocessor path), which is a separate case.

intexcor added 3 commits May 17, 2026 15:25
op.STFT has no kernel in the onnxruntime CUDA execution provider: a
preprocessor graph that uses it gets split, and the STFT node runs on CPU
with host/device copies around it. Accelerators such as CoreML do not
support it either, and for non-power-of-2 FFT sizes it is slow on CPU.

Add a shared preprocessors/stft.py helper that expresses the windowed DFT
as a 1d convolution with a fixed kernel, plus Conv-based variants of every
STFT-using preprocessor: gigaam_v2/v3, nemo80/128, whisper80/128 and kaldi.
The new graphs use only operators with kernels on every execution provider,
so they run fully on GPU; they are numerically equivalent to the STFT graphs.
PreprocessorRuntimeConfig gains a use_conv_preprocessors flag that selects
the Conv-based ONNX preprocessor variants. It defaults to auto: enabled
when a CUDA or TensorRT execution provider is used, disabled otherwise.

When the Conv preprocessors are used the CUDA provider is no longer
excluded from the preprocessor session (op.STFT has no CUDA kernel, the
Conv graph does), so preprocessing runs on the GPU instead of falling
back to a NumPy/CPU implementation.
Parametrize the preprocessor tests over the Conv variants, update the
build file counts, and cover use_conv_preprocessors selection in the
Manager and preprocessor-option tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant