Python bindings for the WebRTC audio processing module. Echo cancellation, noise suppression, automatic gain control, voice activity detection, and high-pass filtering - the same algorithms that run in Chrome, Edge, and every WebRTC-based application.
from pywebrtc_audio import AudioProcessor
ap = AudioProcessor(
sample_rate=16000,
noise_suppression=True,
echo_cancellation=True,
auto_gain_control=True,
stream_delay_ms=40,
)
ap.stream_delay_ms = 50 # adjustable at runtime
# near = what the mic picked up (speech + echo + noise)
# far = what you played through the speaker (reference signal)
# accepts int16 or float32 numpy arrays, returns the same dtype
clean = ap.process(near, far)
# speech probability from the noise suppressor's spectral analysis
print(ap.speech_probability) # 0.0-1.0
print(ap.gain_db) # current AGC gain in dBpip install pywebrtc-audioPre-built wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). Python 3.10-3.14.
See the examples/ directory:
basic.py- Minimal usagestrands_agents_bidi.py- Strands BidiAgent with live echo cancellationstereo.py- Stereo (multi-channel) processing with interleaved layoutagc.py- Automatic gain controlvad_realtime.py- Real-time voice activity detection from the micwav_file.py- Process wav files offlinepyaudio_realtime.py- Real-time echo cancellation with PyAudioe2e_verify.py- Record from mic + speakers, compare raw vs AEC outpute2e_speech.py- Talk while a tone plays, verify speech is preserved
-
Voice agents and assistants - When an AI agent speaks through a speaker and listens through a mic on the same device, it hears its own output as echo. AEC removes the agent's voice from the mic capture so it only hears the user. See
examples/strands_agents_bidi.pyfor a working Strands BidiAgent integration. -
Speech-to-text preprocessing - Clean up mic audio before sending it to a transcription service. Noise suppression removes background noise (fans, traffic, keyboard), AGC normalizes volume across speakers, and the high-pass filter removes low-frequency rumble. Reduces word error rates without any model changes.
-
Telephony and VoIP - The same processing pipeline that runs in Chrome for WebRTC calls, available as a Python library. Process audio from SIP trunks, WebSocket streams, or any other audio source that needs echo cancellation and noise reduction.
-
Voice activity detection - Use
VoiceDetectororspeech_probabilityto detect when someone is speaking. Useful for turn-taking in conversational AI, silence trimming in recordings, or triggering wake-word pipelines only when speech is present. Runs in ~2µs per 10ms frame. -
Robotics - Robots with speakers and microphones face the same echo problem as voice assistants, often worse due to motor noise and reverberant environments. The full pipeline (AEC + NS + AGC) handles all of this in a single
process()call. -
Audio recording and podcasting - Clean up recordings after the fact with
examples/wav_file.py. Remove background noise from interview recordings, normalize volume levels across multiple speakers, or batch-process audio files through the pipeline. -
Real-time audio monitoring - Build live audio meters, speech detectors, or noise level monitors. All processing runs in C++ with the GIL released, so it won't block your Python event loop or UI thread.
All processing runs in C++ with the GIL released. At 16kHz mono (the most common voice configuration), processing 100ms of int16 audio on an Apple M3 Pro:
| Pipeline | Time | Realtime factor |
|---|---|---|
| VoiceDetector | 21 µs | 4,665x |
| NoiseSuppressor | 32 µs | 3,089x |
| GainController | 103 µs | 970x |
| EchoCanceller | 622 µs | 161x |
| AudioProcessor (AEC+NS+AGC) | 649 µs | 154x |
| AudioProcessor (all features) | 686 µs | 146x |
The full pipeline processes 1 second of audio in ~7ms. Even at 48kHz stereo with all features, it runs at 82x real-time. int16 and float32 perform nearly identically.
See benchmarks/BENCHMARK.md for detailed results across all sample rates, dtypes, chunk sizes, and stereo.
Five classes, each with a process() method that accepts int16 or float32 numpy arrays of any length. Internally splits into 10ms frames in a single GIL-released loop. The last frame is zero-padded if the input isn't a multiple of the frame size, and the output is truncated to match the original input length. VoiceDetector returns speech probability instead of audio. AudioProcessor combines them into a single pipeline.
Multi-channel audio uses interleaved layout: [L0, R0, L1, R1, ...]. A 10ms stereo frame at 16kHz is 320 samples (160 per channel × 2 channels). Mono is the default and most common for voice processing.
Instances are not thread-safe. Use one per thread or synchronize externally.
AudioProcessor(
sample_rate=16000,
num_channels=1,
echo_cancellation=False,
noise_suppression=False,
high_pass_filter=False,
auto_gain_control=False,
ns_level=1,
agc_gain_db=0.0,
agc_max_gain_db=50.0,
stream_delay_ms=0,
)Combined audio processing pipeline. Runs echo cancellation, noise suppression, automatic gain control, and high-pass filtering in a single optimized pass over shared audio buffers - avoids the overhead of copying frames between separate processors. Processing order: HP filter -> AEC -> NS -> AGC.
echo_cancellation: Enable AEC3 echo cancellation.noise_suppression: Enable noise suppression.high_pass_filter: Enable high-pass filter (also enabled automatically with AEC).auto_gain_control: Enable AGC2 automatic gain control. Uses speech probability from NS if enabled, otherwise runs its own internal RNN VAD.ns_level: Noise suppression level 0-3 (6dB, 12dB, 18dB, 21dB).agc_gain_db: Fixed gain in dB applied after adaptive gain. Default 0.agc_max_gain_db: Maximum adaptive gain in dB. Default 50.stream_delay_ms: Audio buffer delay hint in milliseconds for AEC. Also available as a read/write property. This is the delay between writing audio to the speaker buffer and the corresponding echo appearing in the mic capture. Most audio APIs report their buffer size - for PyAudio it'sframes_per_buffer / sample_rate * 1000. Default 0 lets AEC3's internal delay estimator figure it out, but providing a hint helps it converge faster.
Note: When echo_cancellation is enabled, a high-pass filter is always applied to the capture signal before echo cancellation, regardless of the high_pass_filter setting. This matches Chrome's behavior - the HP filter removes DC offset that would otherwise degrade AEC performance.
AudioProcessor.process(near, far=None) -> np.ndarrayProcess audio of any length.
near: Microphone capture signal (int16orfloat32numpy array, any length).far: Speaker reference signal (required whenecho_cancellation=True, same length asnear).- Returns: Processed audio (same dtype and length as input).
AudioProcessor.reset()Reset all internal DSP state (AEC filter coefficients, noise estimates, high-pass filter, AGC gain state) while keeping the original configuration. Useful between conversations or after interruptions to avoid stale state affecting the next audio stream.
AudioProcessor.speech_probabilityRead-only property. Speech probability (0.0-1.0) from the most recent process() call. Always available. Priority: noise suppressor's spectral estimate (when noise_suppression=True), then AGC's internal RNN VAD estimate (when auto_gain_control=True), then a lightweight spectral analysis (same as VoiceDetector).
AudioProcessor.gain_dbRead-only property. Current applied gain in dB from the most recent process() call. Only available when auto_gain_control=True; raises RuntimeError otherwise.
GainController(
sample_rate=16000,
num_channels=1,
fixed_gain_db=0.0,
adaptive_digital=True,
max_gain_db=50.0,
headroom_db=5.0,
max_gain_change_db_per_second=6.0,
max_output_noise_level_dbfs=-50.0,
)Standalone automatic gain control using the AGC2 algorithm. Combines adaptive digital gain, fixed digital gain, and a limiter. Uses an internal VAD (same spectral analysis as NoiseSuppressor) unless speech_probability is provided to process().
sample_rate: Audio sample rate in Hz. Supported: 16000, 32000, 48000.num_channels: Number of audio channels (1 for mono, 2 for stereo).fixed_gain_db: Constant gain in dB applied after adaptive gain. Default 0.adaptive_digital: Enable adaptive digital gain. Default True.max_gain_db: Maximum adaptive gain in dB. Default 50.headroom_db: Safety margin below 0 dBFS. Default 5.max_gain_change_db_per_second: Gain slew rate. Default 6.max_output_noise_level_dbfs: Limits gain to avoid amplifying noise. Default -50.
GainController.process(audio, speech_probability=None) -> np.ndarrayProcess audio of any length.
audio: Input audio signal (int16orfloat32numpy array, any length).speech_probability: Float 0.0-1.0, optional. If not provided, uses internal VAD.- Returns: Gained audio (same dtype and length as input).
GainController.reset()Reset internal state (gain estimates, noise/speech levels) while keeping the original configuration.
GainController.gain_dbRead-only property. Current applied gain in dB from the most recent process() call.
EchoCanceller(
sample_rate=16000,
num_channels=1,
stream_delay_ms=0,
)Create an echo canceller. A high-pass filter is always applied to the capture signal before echo cancellation to remove DC offset (matching Chrome's behavior).
sample_rate: Audio sample rate in Hz. Supported: 16000, 32000, 48000.num_channels: Number of audio channels (1 for mono, 2 for stereo).stream_delay_ms: Audio buffer delay hint (seeAudioProcessorabove). Also available as a read/write property.
EchoCanceller.process(near, far) -> np.ndarrayProcess audio of any length.
near: Microphone capture signal (int16orfloat32numpy array, any length).far: Speaker reference signal (int16orfloat32numpy array, same length as near).- Returns: Cleaned audio with echo removed (same dtype and length as input).
EchoCanceller.reset()Reset internal AEC state while keeping the original configuration.
NoiseSuppressor(
sample_rate=16000,
num_channels=1,
level=1,
)Create a noise suppressor.
sample_rate: Audio sample rate in Hz. Supported: 16000, 32000, 48000.num_channels: Number of audio channels (1 for mono, 2 for stereo).level: Suppression level 0-3 (6dB, 12dB, 18dB, 21dB). Default: 1 (12dB).
NoiseSuppressor.process(audio) -> np.ndarrayProcess audio of any length.
audio: Input audio signal (int16orfloat32numpy array, any length).- Returns: Audio with noise suppressed (same dtype and length as input).
NoiseSuppressor.reset()Reset internal noise suppression state while keeping the original configuration.
NoiseSuppressor.speech_probabilityRead-only property. Speech probability (0.0-1.0) from the most recent process() call.
VoiceDetector(
sample_rate=16000,
num_channels=1,
)Lightweight voice activity detector. Runs the same spectral analysis as NoiseSuppressor to compute speech probability, but skips the Wiener filter - no noise suppression is applied to the audio. Use this when you only need VAD.
sample_rate: Audio sample rate in Hz. Supported: 16000, 32000, 48000.num_channels: Number of audio channels (1 for mono, 2 for stereo).
VoiceDetector.process(audio) -> floatAnalyze audio and return speech probability.
audio: Input audio signal (int16orfloat32numpy array, any length).- Returns: Speech probability (0.0-1.0).
VoiceDetector.reset()Reset internal state while keeping the original configuration.