In this example, we showcase how to build a patient front desk voice agent pipeline using WebRTC with real-time transcripts. It uses Pipecat pipeline with FastAPI on the backend, and React on the frontend. This pipeline uses a WebRTC based SmallWebRTCTransport, Riva ASR and TTS models and an Agent service. An example recording of the agent interaction with audio can be found here.
For the hardware required when choosing to self deploy the RIVA ASR and TTS NIMs, please see the details on the RIVA NIMs in the ambient-patient/README.
- NVIDIA AI Enterprise developer licence required to local host NVIDIA NIM Microservices.
- NVIDIA API Key for access to hosted NVIDIA NIM Microservices on the public NVIDIA AI Endpoints. See NVIDIA API Keys for detailed steps.
- NGC API Key for NGC container download and resources.
- Linux operating systems (Ubuntu 22.04 or later recommended)
- Docker
- Docker Compose
Open the ace_controller.env file, and configure the following environment variables:
| Environment Variable | Description | Required/Optional |
|---|---|---|
NGC_API_KEY |
Required for downloading RIVA ASR and TTS NIMs (containers on NGC) for self hosting. See NGC doc on generating one. | Required |
NVIDIA_API_KEY |
Required to access the NVIDIA API catalog public endpoints build.nvidia.com if utilizing public endpoints for the RIVA ASR AND TTS NIMs. See NVIDIA API Keys for generating one. | Required |
CONFIG_PATH |
Required configuration file for the speech pipeline. Option 1: when utilizing the public NVIDIA AI Endpoints for the RIVA ASR and TTS NIMs, specify CONFIG_PATH=./configs/config_riva_public_endpoints.yaml Option 2: when self hosting the RIVA ASR NIM, specify CONFIG_PATH=./configs/config_riva_self_hosting.yaml See Bot Pipeline Customizations for more info. |
Required |
REQUEST_TIMEOUT |
Specifies the timeout threshold in seconds waiting for a response from the request to the agent backend | Required. Suggest to leave as default. |
There are multiple service components in the agentic patient front desk assistant application, all of which are deployed through Docker Compose.
- Riva Parakeet ASR Service (optional for local self hosting)
- Riva Magpie TTS Service (optional for local self hosting)
- Agentic Patient Front Desk Service
- The Speech Service
- The UI Service
Key components of the services are configured as environment variables in the ace_controller.env file and as configuration parameters under configs/config_*.yaml before deploying the application.
If you decided to utilize the public endpoints for the RIVA ASR AND TTS NIMs, and have set CONFIG_PATH=./configs/config_riva_public_endpoints.yaml in your ace_controller.env file, skip this step.
If you decided to self host the RIVA ASR AND TTS NIMs, and have set CONFIG_PATH=./configs/config_riva_self_hosting.yaml in your ace_controller.env file, stand up the RIVA NIMs now with the commands:
docker compose up -d riva-asr-parakeet
docker compose up -d riva-tts-magpieNote: It could take up to 6 minutes for the status of the NIM containers to change from starting to healthy.
docker compose --profile ace-controller up --build -d Note: To enable microphone access in Chrome, go to chrome://flags/, enable "Insecure origins treated as secure", add http://<machine-ip>:4400 to the list, and restart Chrome.
If you want to share widely or want to deploy on cloud platforms, you will need to setup coturn server. Follow instructions in section 5. Set up Coturn Server in the ace-controller/examples/voice_agent_webrtc README for modifications required for using coturn.
You may customize ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) services by configuring the config_*.yaml files in configs/.
The choice between the two .yaml files as-is allows you to switch between NIM public endpoints hosted models and locally deployed models:
-
When
CONFIG_PATH=./configs/config_riva_public_endpoints.yamlis specified, the yaml file as-is sets the following:RivaASRService.server="grpc.nvcf.nvidia.com:443"RivaASRService.function_id="1598d209-5e27-4d3c-8079-4751568b1081"RivaTTSService.server="grpc.nvcf.nvidia.com:443"RivaTTSService.function_id="877104f7-e885-42b9-8de8-f6e4c6303969"
-
When
CONFIG_PATH=./configs/config_riva_self_hosting.yamlis specified, the yaml file as-is sets the following:RivaASRService.server="riva-asr-parakeet:50052"RivaASRService.model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer"RivaTTSService.server="riva-tts-magpie:50051"RivaTTSService.model="magpie_tts_ensemble-Magpie-Multilingual"
For more details on configuration options, such as language options for ASR/TTS, sample_rate for ASR, voice_id options for TTS, refer to the NIM NVIDIA Magpie, NIM NVIDIA Parakeet documentation.
You could change the ASR / TTS models to different ones if you'd like. Browse build.nvidia.com for the available ASR/TTS NIMs. You could find the function_id for each model in the example requests under the Try API tab on build.nvidia.com, such as https://build.nvidia.com/nvidia/magpie-tts-multilingual/api.
In ipa.json, you could add to the dictionary your custom word-pronounciation pairs in the International Phonetic Alphabet (IPA) standard.
Speculative speech processing is a feature in the voice agent that reduces bot response latency by working directly on Riva ASR early interim user transcripts instead of waiting for final transcripts. This is only a feature of Riva ASR. In our application, we disable this feature by default because our healthcare agent backend in LangGraph retains memory of every interaction / request between the user and the agent. With speculative speech, there are two similar requests sent to the agent backend and two responses, one for the interim ASR transcript and one for the final ASR transcript, and getting a response for each request. With both of the interactions logged in the agent memory, there can be great confusion caused. If connected to a backend without memory saved for each request/response, speculative speech can reduce latency without causing any issues.
If you are customizing for an agentic system that is okay with getting one request for interim ASR transcript and one request for the final ASR request, such as a simple LLM backend, you could add ENABLE_SPECULATIVE_SPEECH environment variable and set as true in the ace_controller.env file to enable speculative speech processing.
See the documentation on Speculative Speech Processing for more details.
By default, our application does not save any audio files from the user-agent interaction. If you would like to enable audio file saving, you could add the environment variable DUMP_AUDIO_FILES and set to true in the ace_controller.env file.

