@@ -17,6 +17,7 @@ On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs
1717- ** [ Qwen3-TTS] ( https://soniqo.audio/guides/speak ) ** — Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
1818- ** [ CosyVoice TTS] ( https://soniqo.audio/guides/cosyvoice ) ** — Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
1919- ** [ Kokoro TTS] ( https://soniqo.audio/guides/kokoro ) ** — On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
20+ - ** [ VibeVoice TTS] ( https://soniqo.audio/guides/vibevoice ) ** — Long-form / multi-speaker TTS (Microsoft VibeVoice Realtime-0.5B + 1.5B, MLX, up to 90-min podcast/audiobook synthesis, EN/ZH)
2021- ** [ Qwen3.5-Chat] ( https://soniqo.audio/guides/chat ) ** — On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
2122- ** [ PersonaPlex] ( https://soniqo.audio/guides/respond ) ** — Full-duplex speech-to-speech (7B, audio in → audio out, 18 voice presets)
2223- ** [ DeepFilterNet3] ( https://soniqo.audio/guides/denoise ) ** — Real-time noise suppression (2.1M params, 48 kHz)
@@ -94,7 +95,7 @@ struct DictateView: View {
9495
9596` SpeechUI ` ships only ` TranscriptionView ` (finals + partials) and ` TranscriptionStore ` (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
9697
97- Available SPM products: ` Qwen3ASR ` , ` Qwen3TTS ` , ` Qwen3TTSCoreML ` , ` ParakeetASR ` , ` ParakeetStreamingASR ` , ` NemotronStreamingASR ` , ` OmnilingualASR ` , ` KokoroTTS ` , ` CosyVoiceTTS ` , ` PersonaPlex ` , ` SpeechVAD ` , ` SpeechEnhancement ` , ` SourceSeparation ` , ` Qwen3Chat ` , ` SpeechCore ` , ` SpeechUI ` , ` AudioCommon ` .
98+ Available SPM products: ` Qwen3ASR ` , ` Qwen3TTS ` , ` Qwen3TTSCoreML ` , ` ParakeetASR ` , ` ParakeetStreamingASR ` , ` NemotronStreamingASR ` , ` OmnilingualASR ` , ` KokoroTTS ` , ` VibeVoiceTTS ` , ` CosyVoiceTTS ` , ` PersonaPlex ` , ` SpeechVAD ` , ` SpeechEnhancement ` , ` SourceSeparation ` , ` Qwen3Chat ` , ` SpeechCore ` , ` SpeechUI ` , ` AudioCommon ` .
9899
99100## Models
100101
@@ -111,6 +112,8 @@ Compact view below. **[Full model catalogue with sizes, quantisations, download
111112| [ Qwen3-TTS] ( https://soniqo.audio/guides/speak ) | Text → Speech | MLX, CoreML | 0.6B, 1.7B | 10 |
112113| [ CosyVoice3] ( https://soniqo.audio/guides/cosyvoice ) | Text → Speech | MLX | 0.5B | 9 |
113114| [ Kokoro-82M] ( https://soniqo.audio/guides/kokoro ) | Text → Speech | CoreML (ANE) | 82M | 10 |
115+ | [ VibeVoice Realtime-0.5B] ( https://soniqo.audio/guides/vibevoice ) | Text → Speech (long-form, multi-speaker) | MLX | 0.5B | EN/ZH |
116+ | [ VibeVoice 1.5B] ( https://soniqo.audio/guides/vibevoice ) | Text → Speech (up to 90-min podcast) | MLX | 1.5B | EN/ZH |
114117| [ Qwen3.5-Chat] ( https://soniqo.audio/guides/chat ) | Text → Text (LLM) | MLX, CoreML | 0.8B | Multi |
115118| [ PersonaPlex] ( https://soniqo.audio/guides/respond ) | Speech → Speech | MLX | 7B | EN |
116119| [ Silero VAD] ( https://soniqo.audio/guides/vad ) | Voice Activity Detection | MLX, CoreML | 309K | Agnostic |
@@ -161,6 +164,7 @@ import OmnilingualASR // 1,672 languages (CoreML + MLX)
161164import Qwen3TTS // Text-to-speech
162165import CosyVoiceTTS // Text-to-speech with voice cloning
163166import KokoroTTS // Text-to-speech (iOS-ready)
167+ import VibeVoiceTTS // Long-form / multi-speaker TTS (EN/ZH)
164168import Qwen3Chat // On-device LLM chat
165169import PersonaPlex // Full-duplex speech-to-speech
166170import SpeechVAD // VAD + speaker diarization + embeddings
@@ -240,7 +244,7 @@ let audio = model.synthesize(text: "Hello world", language: "english")
240244try WAVWriter.write (samples : audio, sampleRate : 24000 , to : outputURL)
241245```
242246
243- Alternative TTS engines: [ CosyVoice3] ( https://soniqo.audio/guides/cosyvoice ) (streaming + voice cloning + emotion tags), [ Kokoro-82M] ( https://soniqo.audio/guides/kokoro ) (iOS-ready, 54 voices), [ Voice cloning] ( https://soniqo.audio/guides/voice-cloning ) .
247+ Alternative TTS engines: [ CosyVoice3] ( https://soniqo.audio/guides/cosyvoice ) (streaming + voice cloning + emotion tags), [ Kokoro-82M] ( https://soniqo.audio/guides/kokoro ) (iOS-ready, 54 voices), [ VibeVoice ] ( https://soniqo.audio/guides/vibevoice ) (long-form podcast / multi-speaker, EN/ZH), [ Voice cloning] ( https://soniqo.audio/guides/voice-cloning ) .
244248
245249### Speech-to-Speech — [ full guide →] ( https://soniqo.audio/guides/respond )
246250
@@ -325,8 +329,8 @@ speech-swift is split into one SPM target per model so consumers only pay for wh
325329** [ Full architecture diagram with backends, memory tables, and module map → soniqo.audio/architecture] ( https://soniqo.audio/architecture ) ** · ** [ API reference → soniqo.audio/api] ( https://soniqo.audio/api ) ** · ** [ Benchmarks → soniqo.audio/benchmarks] ( https://soniqo.audio/benchmarks ) **
326330
327331Local docs (repo):
328- - ** Models:** [ Qwen3-ASR] ( docs/models/asr-model.md ) · [ Qwen3-TTS] ( docs/models/tts-model.md ) · [ CosyVoice] ( docs/models/cosyvoice-tts.md ) · [ Kokoro] ( docs/models/kokoro-tts.md ) · [ Parakeet TDT] ( docs/models/parakeet-asr.md ) · [ Parakeet Streaming] ( docs/models/parakeet-streaming-asr.md ) · [ Nemotron Streaming] ( docs/models/nemotron-streaming.md ) · [ Omnilingual ASR] ( docs/models/omnilingual-asr.md ) · [ PersonaPlex] ( docs/models/personaplex.md ) · [ FireRedVAD] ( docs/models/fireredvad.md ) · [ Source Separation] ( docs/models/source-separation.md )
329- - ** Inference:** [ Qwen3-ASR] ( docs/inference/qwen3-asr-inference.md ) · [ Parakeet TDT] ( docs/inference/parakeet-asr-inference.md ) · [ Parakeet Streaming] ( docs/inference/parakeet-streaming-asr-inference.md ) · [ Nemotron Streaming] ( docs/inference/nemotron-streaming-inference.md ) · [ Omnilingual ASR] ( docs/inference/omnilingual-asr-inference.md ) · [ TTS] ( docs/inference/qwen3-tts-inference.md ) · [ Forced Aligner] ( docs/inference/forced-aligner.md ) · [ Silero VAD] ( docs/inference/silero-vad.md ) · [ Speaker Diarization] ( docs/inference/speaker-diarization.md ) · [ Speech Enhancement] ( docs/inference/speech-enhancement.md )
332+ - ** Models:** [ Qwen3-ASR] ( docs/models/asr-model.md ) · [ Qwen3-TTS] ( docs/models/tts-model.md ) · [ CosyVoice] ( docs/models/cosyvoice-tts.md ) · [ Kokoro] ( docs/models/kokoro-tts.md ) · [ VibeVoice ] ( docs/models/vibevoice.md ) · [ Parakeet TDT] ( docs/models/parakeet-asr.md ) · [ Parakeet Streaming] ( docs/models/parakeet-streaming-asr.md ) · [ Nemotron Streaming] ( docs/models/nemotron-streaming.md ) · [ Omnilingual ASR] ( docs/models/omnilingual-asr.md ) · [ PersonaPlex] ( docs/models/personaplex.md ) · [ FireRedVAD] ( docs/models/fireredvad.md ) · [ Source Separation] ( docs/models/source-separation.md )
333+ - ** Inference:** [ Qwen3-ASR] ( docs/inference/qwen3-asr-inference.md ) · [ Parakeet TDT] ( docs/inference/parakeet-asr-inference.md ) · [ Parakeet Streaming] ( docs/inference/parakeet-streaming-asr-inference.md ) · [ Nemotron Streaming] ( docs/inference/nemotron-streaming-inference.md ) · [ Omnilingual ASR] ( docs/inference/omnilingual-asr-inference.md ) · [ TTS] ( docs/inference/qwen3-tts-inference.md ) · [ VibeVoice ] ( docs/inference/vibevoice-inference.md ) · [ Forced Aligner] ( docs/inference/forced-aligner.md ) · [ Silero VAD] ( docs/inference/silero-vad.md ) · [ Speaker Diarization] ( docs/inference/speaker-diarization.md ) · [ Speech Enhancement] ( docs/inference/speech-enhancement.md )
330334- ** Reference:** [ Shared Protocols] ( docs/shared-protocols.md )
331335
332336## Cache configuration
0 commit comments