Qwen3-TTS and Qwen3-ASR TensorRT backend for NVIDIA Jetson platforms.
-
Qwen3-TTS: 0.6B text-to-speech model with TensorRT optimization
- RTF ~0.9-1.0 (real-time factor) on Jetson Orin NX
- TTFT ~45ms (streaming first chunk)
- Voice cloning support via speaker embedding
- 52 languages support
- Streaming synthesis
-
Qwen3-ASR: 0.6B automatic speech recognition model
- RTF ~0.12-0.22 on Jetson Orin NX
- 52 languages support
- Streaming transcription (accumulate-then-transcribe)
pip install jetson-qwen3-speechpip install git+https://github.com/harvest/jetson-qwen3-speech.gitFor C++ engine build from source, ensure you have:
- CUDA 12.x
- TensorRT 8.6+
- CMake 3.22+
- pybind11 2.10+
from jetson_qwen3_speech import Qwen3TRTBackend
backend = Qwen3TRTBackend()
backend.preload()
# Basic synthesis
wav_bytes = backend.synthesize("Hello world", language="english")
# Voice cloning
import base64
# Extract speaker embedding from reference audio
with open("reference.wav", "rb") as f:
ref_audio = f.read()
embedding = backend.extract_speaker_embedding(ref_audio)
# Synthesize with cloned voice
wav_bytes = backend.clone_voice(
"Hello world",
speaker_embedding=embedding,
language="english"
)
# Streaming synthesis
for chunk in backend.generate_streaming("Hello world", language="english"):
# chunk is int16 PCM bytes
process_audio(chunk)from jetson_qwen3_speech import Qwen3ASRBackend
backend = Qwen3ASRBackend()
backend.preload()
# Offline transcription
with open("audio.wav", "rb") as f:
result = backend.transcribe(f.read(), language="auto")
print(result.text) # "Hello world"
# Streaming transcription
stream = backend.create_stream(language="auto")
stream.accept_waveform(16000, audio_chunk1)
stream.accept_waveform(16000, audio_chunk2)
text = stream.finalize()# In jetson-voice app/tts_backend.py
def create_backend(backend_name=None):
if backend_name == "qwen3_trt":
from jetson_qwen3_speech import Qwen3TRTBackend
return Qwen3TRTBackend()
# ...Models should be placed at /opt/models/qwen3-tts/ and /opt/models/qwen3-asr-v2/:
/opt/models/qwen3-tts/
├── config.json
├── tokenizer/
│ ├── tokenizer.json
│ └── tokenizer.model
├── onnx/
│ ├── speaker_encoder.onnx
│ └── ...
└── engines/
├── talker_decode_bf16.engine
├── cp_bf16.engine
└── vocoder.engine
/opt/models/qwen3-asr-v2/
├── encoder.onnx
├── embed_tokens.bin
├── tokenizer.json
└── asr_decoder_bf16.engine
Export models using the included scripts:
# Export ASR ONNX
python -m jetson_qwen3_speech.export.export_asr_onnx --output /opt/models/qwen3-asr-v2
# Build TTS TensorRT engines
python -m jetson_qwen3_speech.export.build_cp_engine --model-dir /opt/models/qwen3-tts| Capability | TTS | ASR |
|---|---|---|
| Basic/Offline | ✓ | ✓ |
| Streaming | ✓ | ✓ |
| Multi-language | ✓ (52) | ✓ (52) |
| Voice Clone | ✓ | - |
| Language ID | - | ✓ |
Measured on Jetson Orin NX (100 TOPS, JetPack 6.x):
| Component | Latency | Notes |
|---|---|---|
| Prefill | 16-43ms | Token count dependent |
| Talker decode | ~20ms/step | TRT BF16 |
| Code predictor | ~53ms/step | TRT BF16 |
| Vocoder | ~98ms | TRT FP16 |
| TTFT (streaming) | ~45ms | First audio chunk |
| RTF (batch) | 0.9-1.0 | Real-time factor |
| Test | Audio | Time | RTF |
|---|---|---|---|
| Chinese short | 1.20s | 1.88s | 1.41 |
| Chinese medium | 5.28s | 5.47s | 1.00 |
| Chinese long | 10.80s | 10.36s | 0.94 |
| English short | 1.04s | 1.09s | 0.92 |
| English medium | 5.04s | 4.81s | 0.92 |
| Component | Latency |
|---|---|
| Encoder (ORT CUDA) | ~50-100ms |
| Decode (TRT BF16) | ~5ms/token |
| RTF | ~0.1-0.2 |
Voice-to-Voice (ASR + TTS) end-to-end latency varies by text length. Typical conversational response: 1-3 seconds total.
MIT