Zero-shot TTS allows you to clone any voice using a short audio sample (5+ seconds). This feature uses the Magpie Zero-shot model. Apply for access at the NVIDIA RIVA TTS Zero-shot models page.
-
Prepare your audio sample to meet the following requirements.
- Audio format is WAV (16-bit PCM recommended).
- Audio duration is at least 5 seconds of clean speech.
- Audio quality is clear, no background noise.
- Language should match your target language.
- Speaker is single speaker only.
-
Create the
audio_promptsdirectory and add your voice sample.mkdir -p audio_prompts cp <your_audio_file>.wav audio_prompts/<your_audio_file>.wav
-
Run the Magpie zero-shot NIM microservice following the instructions in the NVIDIA NIM RIVA TTS documentation.
-
Set the environment variables in the
.envfile as follows.# Comment out standard TTS configuration #TTS_DOCKER_IMAGE=nvcr.io/nim/nvidia/magpie-tts-multilingual:1.6.0 #TTS_VOICE_ID=Magpie-Multilingual.EN-US.Aria #TTS_MODEL_NAME=magpie_tts_ensemble-Magpie-Multilingual #TTS_NIM_TAGS=name=magpie-tts-multilingual,batch_size=32 # Enable Zero-shot TTS TTS_DOCKER_IMAGE=<ZEROSHOT_NIM_MICROSERVICE_IMAGE> # Use your version TTS_VOICE_ID=Magpie-ZeroShot.Female-1 TTS_MODEL_NAME=magpie_tts_ensemble-Magpie-ZeroShot TTS_NIM_TAGS=name=magpie-tts-zeroshot,batch_size=32 ZERO_SHOT_AUDIO_PROMPT=audio_prompts/custom_voice.wav
-
Update the
python-appservice in docker-compose.yml to mount audio prompts:python-app: # ... existing configuration ... volumes: - ./audio_dumps:/app/audio_dumps - ./config/:/app/config/ - ./audio_prompts:/app/audio_prompts # Add this line
-
Deploy the services:
docker compose up -d