.... .- .. --.. . .-.. .- -... ... ✨🎤 pip install spoken 🎤✨ .. - .----. ... / .- / -... .- -.. / -.. .- -.--
spoken provides a single abstraction for a variety of audio foundation models. It is primarily designed for large-scale evaluation/benchmarking of realtime speech-to-speech models, but it can also be used as a drop-in inference library.
# os.environ['LOG_LEVEL'] = 'DEBUG' # detailed client/server state management logging
import spoken
model = spoken("gpt-4o-realtime-preview-2024-12-17", "examples/scooby.wav")
input_asr, output_asr, output_audio = await model.run()
output_asr # "That's quite the story..."
len(output_audio) # 8549ms
model.output_audio_tokens # 254Large audio models operate on audio tokens rather than transcribed text. This enables low-latency streaming conversational audio agents that directly generate audio end-to-end. Although promising and exciting, using these models requires non-trivial configuration and state management, due to major providers differing significantly in interface.
(AFAWK,) spoken supports all provider speech-to-speech models.
- OpenAI Realtime
- gpt-4o-realtime-preview-2024-12-17
- gpt-4o-mini-audio-preview-2024-12-17 [coming soon, not part of realtime API]
- Gemini Multimodal Live
- gemini-2.5-flash-preview-native-audio-dialog
- gemini-2.5-flash-exp-native-audio-thinking-dialog
- Amazon Nova Sonic (
pip install spoken[nova])- amazon.nova-sonic-v1:0
- Benchmarking TTFT (Time-To-First-Token) Latency
- OpenAI System Prompt
- more interesting things coming soon...
- Simply run
pip install spoken- Python 3.12+ required +
pip install spoken[nova]+portaudio.h(+ OS X:brew install portaudio) for Amazon Nova Sonic support
- Python 3.12+ required +
