Skip to content

suharvest/jetson-qwen3-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jetson-qwen3-speech

Qwen3-TTS and Qwen3-ASR TensorRT backend for NVIDIA Jetson platforms.

Features

  • Qwen3-TTS: 0.6B text-to-speech model with TensorRT optimization

    • RTF ~0.9-1.0 (real-time factor) on Jetson Orin NX
    • TTFT ~45ms (streaming first chunk)
    • Voice cloning support via speaker embedding
    • 52 languages support
    • Streaming synthesis
  • Qwen3-ASR: 0.6B automatic speech recognition model

    • RTF ~0.12-0.22 on Jetson Orin NX
    • 52 languages support
    • Streaming transcription (accumulate-then-transcribe)

Installation

Prebuilt Wheel (Recommended)

pip install jetson-qwen3-speech

From Source

pip install git+https://github.com/harvest/jetson-qwen3-speech.git

For C++ engine build from source, ensure you have:

  • CUDA 12.x
  • TensorRT 8.6+
  • CMake 3.22+
  • pybind11 2.10+

Usage

TTS Backend

from jetson_qwen3_speech import Qwen3TRTBackend

backend = Qwen3TRTBackend()
backend.preload()

# Basic synthesis
wav_bytes = backend.synthesize("Hello world", language="english")

# Voice cloning
import base64
# Extract speaker embedding from reference audio
with open("reference.wav", "rb") as f:
    ref_audio = f.read()
embedding = backend.extract_speaker_embedding(ref_audio)

# Synthesize with cloned voice
wav_bytes = backend.clone_voice(
    "Hello world",
    speaker_embedding=embedding,
    language="english"
)

# Streaming synthesis
for chunk in backend.generate_streaming("Hello world", language="english"):
    # chunk is int16 PCM bytes
    process_audio(chunk)

ASR Backend

from jetson_qwen3_speech import Qwen3ASRBackend

backend = Qwen3ASRBackend()
backend.preload()

# Offline transcription
with open("audio.wav", "rb") as f:
    result = backend.transcribe(f.read(), language="auto")
print(result.text)  # "Hello world"

# Streaming transcription
stream = backend.create_stream(language="auto")
stream.accept_waveform(16000, audio_chunk1)
stream.accept_waveform(16000, audio_chunk2)
text = stream.finalize()

Integration with jetson-voice

# In jetson-voice app/tts_backend.py
def create_backend(backend_name=None):
    if backend_name == "qwen3_trt":
        from jetson_qwen3_speech import Qwen3TRTBackend
        return Qwen3TRTBackend()
    # ...

Model Files

Models should be placed at /opt/models/qwen3-tts/ and /opt/models/qwen3-asr-v2/:

/opt/models/qwen3-tts/
├── config.json
├── tokenizer/
│   ├── tokenizer.json
│   └── tokenizer.model
├── onnx/
│   ├── speaker_encoder.onnx
│   └── ...
└── engines/
    ├── talker_decode_bf16.engine
    ├── cp_bf16.engine
    └── vocoder.engine

/opt/models/qwen3-asr-v2/
├── encoder.onnx
├── embed_tokens.bin
├── tokenizer.json
└── asr_decoder_bf16.engine

Model Export

Export models using the included scripts:

# Export ASR ONNX
python -m jetson_qwen3_speech.export.export_asr_onnx --output /opt/models/qwen3-asr-v2

# Build TTS TensorRT engines
python -m jetson_qwen3_speech.export.build_cp_engine --model-dir /opt/models/qwen3-tts

Capabilities

Capability TTS ASR
Basic/Offline
Streaming
Multi-language ✓ (52) ✓ (52)
Voice Clone -
Language ID -

Performance

Measured on Jetson Orin NX (100 TOPS, JetPack 6.x):

TTS Performance

Component Latency Notes
Prefill 16-43ms Token count dependent
Talker decode ~20ms/step TRT BF16
Code predictor ~53ms/step TRT BF16
Vocoder ~98ms TRT FP16
TTFT (streaming) ~45ms First audio chunk
RTF (batch) 0.9-1.0 Real-time factor

Benchmark Results (实测)

Test Audio Time RTF
Chinese short 1.20s 1.88s 1.41
Chinese medium 5.28s 5.47s 1.00
Chinese long 10.80s 10.36s 0.94
English short 1.04s 1.09s 0.92
English medium 5.04s 4.81s 0.92

ASR Performance

Component Latency
Encoder (ORT CUDA) ~50-100ms
Decode (TRT BF16) ~5ms/token
RTF ~0.1-0.2

V2V Latency

Voice-to-Voice (ASR + TTS) end-to-end latency varies by text length. Typical conversational response: 1-3 seconds total.

License

MIT

References

About

Qwen3-TTS/ASR TensorRT backend for NVIDIA Jetson platforms

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors