This guide covers deploying the Nemotron Voice Agent on Jetson Thor using Docker Compose.
- Jetson Thor flashed with JetPack 7.0 using NVIDIA SDK Manager (with CUDA, CUDA-X, TensorRT, and NVIDIA Container Runtime components installed)
- NGC CLI installed and configured
- Docker Engine and Docker Compose
- HuggingFace API token for downloading LLM models
- Network connectivity
The configuration files for this deployment are the following:
./
├── docker-compose.jetson.yml # Jetson-specific deployment
└── config
└── env.jetson.example # Template for .env
| File | Purpose |
|---|---|
| docker-compose.jetson.yml | Jetson-specific Docker Compose with vLLM |
| env.jetson.example | Environment template for Jetson deployment |
Note: This deployment uses vLLM for LLM inference instead of NVIDIA NIM. LLM NIM microservices for Jetson Thor are not yet available, so this guide uses vLLM as a flexible alternative to load Hugging Face models directly.
All models are deployed on the local Jetson device. Default models used:
| Component | Default model / identifier |
|---|---|
| ASR | parakeet-1.1b-en-US-asr-streaming |
| TTS | magpie_tts_ensemble-Magpie-Multilingual |
| LLM | RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 |
-
Clone the repository and navigate to the root directory.
git clone git@github.com:NVIDIA-AI-Blueprints/nemotron-voice-agent.git cd nemotron-voice-agent git submodule update --init -
Configure the environment. Copy the example environment file env.jetson.example to the root directory:
cp config/env.jetson.example .env
-
Set your API keys as environment variables:
# Required export NVIDIA_API_KEY=<your-nvidia-api-key> export HF_TOKEN=<your-huggingface-token>
Jetson-specific defaults (differ from main deployment)
ENABLE_SPECULATIVE_SPEECH=false— Disabled for resource optimization.WORKERS=1— Single worker to reduce memory usage.
-
Deploy Nemotron Speech ASR and TTS models.
a. Ensure you meet the prerequisites before proceeding.
b. Configure NGC CLI with your API key:
ngc config setc. Download the Nemotron Speech ASR and TTS Quick Start scripts:
ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.24.0 cd riva_quickstart_arm64_v2.24.0d. [Optional] Enable Silero VAD for improved ASR performance:
Tip: Enabling Silero VAD can help improve End-of-Utterance (EOU) detection and performance in noisy environments.
-
Edit the
config.shin Quick Start directoryriva_quickstart_arm64_v2.24.0:# Use Silero Diarizer as accessory model for ASR asr_accessory_model=("silero_diarizer")
-
Update your
.envfile with the ASR model name:ASR_MODEL_NAME=parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer
e. Deploy Nemotron Speech ASR and TTS models:
bash riva_init.sh bash riva_start.sh
Note: Initialization may take 30-60 minutes on first run.
-
-
Start LLM Service and Voice Agent Application. Start services from the root directory:
docker compose -f docker-compose.jetson.yml up -d
Note: Deployment may take 15-20 minutes on first run.
-
Access the application at
http://<jetson-ip>:8081on your browser.Tip: For the best experience, we recommend using a headset (preferably wired) instead of your laptop's built-in microphone.
Note: To enable microphone access in Chrome, go to
chrome://flags/, enable "Insecure origins treated as secure", addhttp://<jetson-ip>:8081to the list, and restart Chrome. If you need to access the application from remote locations or deploy on cloud platforms, configure a TURN server. Refer to Optional: Deploy TURN Server for Remote Access.
The Jetson deployment uses vLLM to load HuggingFace models. Update these variables in your .env file:
| Variable | Description |
|---|---|
NVIDIA_LLM_MODEL |
HuggingFace model identifier |
GPU_MEMORY_UTILIZATION |
GPU memory fraction (0.0-1.0). Adjust based on model size. |
SYSTEM_PROMPT_SELECTOR |
Prompt path from config/prompt.yaml |
| Model | Size | GPU_MEMORY_UTILIZATION |
|---|---|---|
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 |
8B (4-bit) | 0.15 |
nvidia/Nemotron-Mini-4B-Instruct |
4B | 0.10 |
nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8 |
9B (FP8) | 0.20 |
Qwen/Qwen3-4B-Instruct-2507 |
4B | 0.10 |
-
Edit your
.envfile:NVIDIA_LLM_MODEL=nvidia/Nemotron-Mini-4B-Instruct GPU_MEMORY_UTILIZATION=0.10 SYSTEM_PROMPT_SELECTOR=llama/flowershop
-
Restart the services:
docker compose -f docker-compose.jetson.yml down docker compose -f docker-compose.jetson.yml up -d
-
Verify the model is loading:
docker compose -f docker-compose.jetson.yml logs -f llm-nvidia-jetson
Note: The first model download may take several minutes depending on model size and network speed.
# View logs
docker compose -f docker-compose.jetson.yml logs -f python-app
# Stop all services
docker compose -f docker-compose.jetson.yml down
# Rebuild after code changes
docker compose -f docker-compose.jetson.yml up --build -d python-app