diff --git a/INSTALL.md b/INSTALL.md index 60e2053..00fecc5 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -361,15 +361,18 @@ Ollama serves on `http://localhost:11434/v1` by default. In the UI, set: vLLM provides high-throughput serving with GPU acceleration. Example with Cosmos-Reason2 for vision: -**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images. +**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images. The FP8 model is downloaded from NGC; you need an NGC account with access to the **nim** org (and often the **nvidia** team). If NGC download fails, see [NGC Cosmos model download fails](#ngc-cosmos-model-download-fails-completed-0-failed-n) below. Quick reference for Jetson Thor (after downloading the FP8 model per the link above): ```bash export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8" +mkdir -p ~/.cache/vllm +sudo sysctl -w vm.drop_caches=3 sudo docker run -it --rm --runtime=nvidia --network host \ -v $MODEL_PATH:/models/cosmos-reason2-8b:ro \ + -v ${HOME}/.cache/vllm:/root/.cache/vllm \ ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \ vllm serve /models/cosmos-reason2-8b \ --served-model-name nvidia/cosmos-reason2-8b-fp8 \ @@ -378,13 +381,152 @@ sudo docker run -it --rm --runtime=nvidia --network host \ --reasoning-parser qwen3 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --enable-prefix-caching \ - --port 8000 + --port 8010 ``` +The second volume `-v ${HOME}/.cache/vllm:/root/.cache/vllm` persists vLLM’s **torch.compile cache** on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create `~/.cache/vllm` **before** the first run (as in the example above) so it is owned by your user; otherwise the container may create it as root and you can hit permission issues later. + +`vm.drop_caches=3` frees **system (CPU) memory** (page cache, etc.); it does **not** free **GPU VRAM**. If you start vLLM a second time while the first container is still running, the GPU has no free VRAM and vLLM will fail with "Free memory on device cuda:0 (...) is less than desired". **Stop the first vLLM container** (e.g. Ctrl+C or `docker stop`) so the driver releases GPU memory, then start again. + +> **Port conflict with Riva**: The **Riva container** exposes ports **8000–8002** (and 8888, 50051). If you run both Riva and vLLM on the same machine, use a different vLLM port so they don't clash. The example above uses `--port 8010`; in the app set **LLM API Base** to `http://localhost:8010/v1`. If Riva is not running, `--port 8000` is fine. +> > **Memory tuning**: On shared-memory systems (Jetson), lower `--gpu-memory-utilization` to leave room for the OS, Riva, and the application. On discrete GPUs with dedicated VRAM, `0.8` is safe. > > **Desktop GPU / x86_64**: Use `vllm/vllm-openai:latest` or `nvcr.io/nvidia/vllm:latest` instead of the Jetson image. +### vLLM troubleshooting + +#### `OSError: [Errno 98] Address already in use` + +vLLM fails at startup with `sock.bind(addr) OSError: [Errno 98] Address already in use` when the API port (default **8000**) is already taken—for example by a previous vLLM run, another container, or another service. + +**1. Find what is using the port** + +```bash +# Default vLLM port is 8000; use your --port if different +lsof -i :8000 +# or +ss -tlnp | grep 8000 +# or +fuser 8000/tcp +``` + +If **`ss` shows port 8000 in LISTEN but `lsof` and `fuser` show no PID**, the process is usually **inside a Docker container**. List containers and look for one that has port 8000: + +```bash +docker ps -a +# Look for a container with 0.0.0.0:8000->8000/tcp or similar in PORTS +``` + +**2. Free the port or use another** + +- **Riva is using 8000** (container `riva-speech` exposes 8000–8002): Don't stop Riva. Start vLLM on a different port and point the app to it: + ```bash + # In the vllm serve command, use e.g.: + --port 8010 + + # In Multi-modal AI Studio, set LLM API Base to: + # http://localhost:8010/v1 + ``` +- **Another Docker container** (e.g. leftover vLLM): Stop and remove it if you don't need it: + ```bash + docker ps -a + docker stop + docker rm + # or: docker rm -f + ``` +- **Process on the host** (when lsof/fuser show a PID): Kill it: + ```bash + kill + # or: fuser -k 8000/tcp + ``` + +**3. Use a different port** + +If you need to keep whatever is on 8000, start vLLM with `--port 8010` (or another free port) and set the app's **LLM API Base** to `http://localhost:8010/v1`. + +#### `ValueError: Free memory on device cuda:0 (...) is less than desired GPU memory utilization` + +Another process (often a **previous vLLM container**) is still using the GPU, so there isn’t enough free VRAM. Stop the other process: if the first vLLM was started in another terminal, press **Ctrl+C** there, or run `docker ps` and `docker stop `. The driver may take **30–60 seconds** to release VRAM after the container exits; run `nvidia-smi` and wait until free memory is back to normal before starting vLLM again. `vm.drop_caches=3` only frees system RAM, not GPU VRAM. + +#### `ValidationError: Invalid repository ID or local directory specified: '/models/...'` + +vLLM fails during startup with a message like **Invalid repository ID or local directory specified: '/models/cosmos-reason2-8b'** when the model path inside the container is missing, wrong, or doesn't contain the expected config files. + +**1. Check the model directory on the host** + +Ensure `MODEL_PATH` points to the directory that contains the model files (e.g. `config.json` for Hugging Face–style models): + +```bash +echo $MODEL_PATH +ls -la "$MODEL_PATH" +# Must contain at least: config.json (and usually model weights, tokenizer files, etc.) +``` + +If the directory is missing or empty, download the model first (see [Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/) or your model’s instructions). + +**2. Check the volume mount** + +The `docker run` command must mount that host path into the container path vLLM uses: + +```bash +# Example: host path -> container path /models/cosmos-reason2-8b +-v $MODEL_PATH:/models/cosmos-reason2-8b:ro +``` + +- Use an **absolute path** for `MODEL_PATH` (e.g. `$HOME/.cache/huggingface/hub/...`), not a relative one, so the mount is correct from any working directory. +- The path after the colon must match the path you pass to `vllm serve` (e.g. `vllm serve /models/cosmos-reason2-8b`). + +**3. Verify the container sees the files** + +Run a quick check that the mounted directory exists and has a config inside the container: + +```bash +docker run --rm -v "$MODEL_PATH:/models/cosmos-reason2-8b:ro" \ + ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \ + ls -la /models/cosmos-reason2-8b +``` + +You should see `config.json` and other model files. If the list is empty or "No such file or directory", fix `MODEL_PATH` or the mount path and try again. + +#### Fix Hugging Face cache permissions (root-owned) + +If `~/.cache/huggingface` or `~/.cache/huggingface/hub` is owned by **root** (e.g. created by [jetson-containers](https://github.com/dusty-nv/jetson-containers) or another tool running with `sudo`), commands run as your user (NGC CLI, Python, Hugging Face libraries) will get **Permission denied** when writing there. + +**Fix:** make the cache tree owned by the current user: + +```bash +sudo chown -R $USER:$USER ~/.cache/huggingface +``` + +Then retry the download or command that was failing. To avoid the issue in the future, create the directory as your user before any tool that might run as root: `mkdir -p ~/.cache/huggingface/hub`. + +#### NGC Cosmos model download fails (Completed: 0, Failed: N) + +If `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub` fails with **Completed: 0, Failed: 14**: + +- **Permission denied when writing files:** If the debug log shows `[Errno 13] Permission denied` for paths under `~/.cache/huggingface/hub/`, the destination is likely **root-owned**. See [Fix Hugging Face cache permissions (root-owned)](#fix-hugging-face-cache-permissions-root-owned) above: run `sudo chown -R $USER:$USER ~/.cache/huggingface`, then retry. If you only need to fix the model subdirectory: `sudo chown -R $USER:$USER ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` (and ensure the parent `hub` is writable). Alternatively remove the partial dir and re-download: `rm -rf ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` then run the same `ngc registry model download-version ...` again, or use a different `--dest` you can write to. + +- **403 or auth/org errors:** If the debug log shows **403** or org/entitlement errors (rather than Permission denied), try setting the effective org to **nvidia**. The NGC CLI uses **`NGC_CLI_ORG`** from the environment. Example for `~/.bashrc`: + ```bash + export NGC_CLI_ORG=nvidia + # optional: export NGC_CLI_API_KEY= + ``` + Then in the same shell (or a new terminal after `source ~/.bashrc`): + ```bash + ngc config current # check effective org + ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub + export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8" + ``` + If the env var is not picked up, use `--org nvidia` on the command (next bullet). The download often succeeds with the default org (e.g. with `NGC_CLI_ORG` unset); only try `nvidia` if you see 403 or org/entitlement errors. + +- **Explicit org/team:** `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --org nim --team nvidia --dest ~/.cache/huggingface/hub` + +- **Browser:** If you can download from the [catalog page](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/models/cosmos-reason2-8b?version=1208-fp8-static-kv8) in the browser, save the files into `~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` and set `MODEL_PATH` to that directory. + +- **Different machine:** If the same API key and `NGC_CLI_ORG=nvidia` work on one host but not another, the failing host may differ by network, NGC CLI version, or backend. Run with **`--debug`** to see the underlying error: `ngc --debug registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub`. Reliable workaround: **copy the model from the working machine** (e.g. from jat03): `rsync -avz jetson@jat03-iso384:~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/ ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` then set `MODEL_PATH` to that directory. + + ### Option C: OpenAI API No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide your API key, and choose a model (`gpt-4o` for vision, `gpt-4o-mini` for text). @@ -395,9 +537,9 @@ No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide # Ollama curl -s http://localhost:11434/v1/models | python3 -m json.tool -# vLLM -curl -s http://localhost:8000/v1/models | python3 -m json.tool -curl -s http://localhost:8000/health && echo "READY" || echo "NOT READY" +# vLLM (use your port if different, e.g. 8010 when Riva uses 8000) +curl -s http://localhost:8010/v1/models | python3 -m json.tool +curl -s http://localhost:8010/health && echo "READY" || echo "NOT READY" ``` ### Using Vision diff --git a/README.md b/README.md index 04adad3..29b09b3 100644 --- a/README.md +++ b/README.md @@ -2,50 +2,47 @@ ![](./docs/images/screenshot_example_2.png) -**Voice, Text, and Video AI Interface with Advanced Performance Analysis** +**Voice, text, and video conversational AI with session analysis and latency metrics** -Multi-modal AI Studio is a next-generation conversational AI interface designed for analyzing and optimizing voice AI systems. Built on NVIDIA Riva, OpenAI APIs, and other backends, it features sophisticated session management, real-time timeline visualization, and comprehensive latency metrics. +Multi-modal AI Studio is a conversational AI interface for building and tuning voice AI systems. It supports NVIDIA Riva, OpenAI, and other backends; records sessions with full config snapshots; and provides a real-time timeline and latency analysis (TTFA, turn-taking) to compare and optimize setups. ## 🌟 Key Features ### Multi-modal Support -- **Voice Input/Output**: Streaming ASR and TTS via Riva or OpenAI -- **Text Chat**: Traditional text-based conversation -- **Video**: Camera feed for vision-enabled models (future) -- **Mixed Modes**: Voice-to-text, text-to-voice, or text-only +- **Voice**: Streaming ASR and TTS (Riva, OpenAI, or other backends) +- **Text**: Chat-only mode or combined with voice +- **Video**: Camera feed for vision-language models (VLM); browser WebRTC or server USB webcam +- **Mixed modes**: Voice-to-text, text-to-voice, voice-to-voice, or text-only ### Multi-backend Architecture -- **NVIDIA Riva**: gRPC streaming ASR/TTS -- **OpenAI**: REST API (Whisper, TTS) and Realtime API -- **Azure Speech**: Coming soon -- **Custom backends**: Extensible plugin system +- Speech + - **NVIDIA Riva**: gRPC streaming ASR/TTS (Jetson/ARM64) + - **OpenAI-compatible Realtime API**: Realtime API +- LLM: **OpenAI-compatible** REST API, to works with many inference engines for various LLM/VLM models +- **Extensible**: Plugin-style backends; Azure Speech and others can be added ### Session Management -- **Configuration Snapshots**: Every session saves ASR/LLM/TTS configs -- **Timeline Recording**: Store performance data for offline analysis -- **Preset System**: Save and load configuration presets -- **Export/Import**: Generate CLI commands or YAML configs from WebUI +- **Config snapshots**: Every session stores ASR/LLM/TTS and device settings +- **Timeline recording**: Performance data for offline analysis +- **Presets**: Save and load configuration presets ### Performance Analysis -- **Real-time Timeline**: Multi-lane visualization (Audio, Speech, LLM, TTS) -- **Latency Metrics**: TTFA (Time to First Audio), turn-taking analysis -- **Comparison Mode**: Compare multiple sessions to optimize configs -- **Session Replay**: Analyze recorded timeline data +- **Real-time timeline**: Multi-lane view (Audio, Speech, LLM, TTS) +- **Latency metrics**: TTFA (Time to First Audio), turn-taking -### Flexible Deployment -- **WebUI Mode**: Rich browser interface (default) -- **Headless Mode**: CLI-only for production/automation (not yet implemented) -- **Audio/Video devices**: **Currently supported:** browser devices via WebRTC (mic, speaker, camera through the browser). **Not yet supported:** local USB microphone, USB speaker, or USB webcam attached to the server machine. +### UI & Devices +- **Chat-style UI**: Familiar layout, video full-screen mode, keyboard shortcuts. Most settings are exposed in the UI (ASR/LLM/TTS, models, devices) so you can tweak and switch backends without editing config files or code. +- **Devices**: Client-side (browser WebRTC) and server-side (Linux USB mic, USB speaker, USB webcam); choose in the Devices tab. +- **Headless** (experimental, not well tested): CLI with config file or args; see [INSTALL.md](INSTALL.md). ## 🚀 Quick Start ### Prerequisites -- Python 3.8+ -- **Audio/video**: Use the app in a browser; mic, speaker, and camera are accessed via WebRTC (browser devices). Local USB mic/speaker/webcam on the server are not supported yet. -- NVIDIA Riva (for Riva backend) - see [INSTALL.md](INSTALL.md#nvidia-riva-setup-for-voice-asrtts) -- OpenAI API key (for OpenAI backend) - optional -- **Optional**: `jq` (e.g. `apt install jq` or `brew install jq`) for pretty-formatted LLM request/response logs in the server console; without it, logs use plain JSON +- **Python 3.8+** +- **Audio/video**: Browser (WebRTC) for mic, speaker, and camera. On Linux, server **USB microphone**, **USB speaker**, and **USB webcam** are also supported; see [INSTALL.md](INSTALL.md). +- **Backends (as needed)**: [NVIDIA Riva](INSTALL.md#nvidia-riva-setup-for-voice-asrtts) for ASR/TTS; OpenAI API key for OpenAI/Realtime backends (optional). +- **Optional**: `jq` for pretty-printed LLM logs in the console (`apt install jq` or `brew install jq`). ### Installation @@ -72,27 +69,9 @@ Full steps and troubleshooting: [INSTALL.md](INSTALL.md) ```bash # View sessions and timeline (no backend required) python -m multi_modal_ai_studio --port 8092 - -# With Riva ASR/TTS (use --asr-server and --tts-server) -python -m multi_modal_ai_studio \ - --port 8092 \ - --asr-server localhost:50051 \ - --tts-server localhost:50051 \ - --llm-api-base http://localhost:11434/v1 \ - --llm-model llama3.2:3b - -# With OpenAI Realtime API -python -m multi_modal_ai_studio \ - --port 8092 \ - --asr-scheme openai-realtime \ - --tts-scheme openai-realtime \ - --llm-api-key sk-... - -# With preset -python -m multi_modal_ai_studio --preset low-latency ``` -Open **http://localhost:8092** in your browser. +Open **http://localhost:8092** in your browser. For voice (Riva, OpenAI, etc.) and other options, see [INSTALL.md](INSTALL.md). ### Kill a Running Server @@ -107,30 +86,31 @@ lsof -i :8092 kill ``` -**Sessions and sample data** -By default the app loads and saves sessions in `sessions/`. To view or use the sample/mock session JSONs (e.g. in `mock_sessions/`), run with `--session-dir mock_sessions`. Open the app, then click a session in the sidebar to view its config and timeline. +### Sessions and sample data + +Sessions are stored in `sessions/` by default. To try sample timelines, run with `--session-dir mock_sessions` and open a session from the sidebar. -### Run Headless +### Run headless (experimental) + +CLI-only mode for automation or local audio devices. Requires the `[audio]` extra and device setup; see [INSTALL.md](INSTALL.md). ```bash -# From config file python -m multi_modal_ai_studio --mode headless --config my-config.yaml -# From CLI args -python -m multi_modal_ai_studio \ - --mode headless \ - --audio-input alsa:hw:0,0 \ - --audio-output alsa:hw:1,0 \ - --asr-scheme riva \ - --llm-model llama3.2:3b +# Or with CLI args (e.g. ALSA devices) +python -m multi_modal_ai_studio --mode headless \ + --audio-input alsa:hw:0,0 --audio-output alsa:hw:1,0 \ + --asr-scheme riva --llm-model llama3.2:3b ``` ## 📖 Documentation -- [VLM Guide](docs/vlm_guide.md) — Vision-Language Model setup, input modes, frame capture, and tuning -- [Riva Setup](docs/setup_riva.md) — NVIDIA Riva ASR/TTS installation and configuration -- [Architecture](docs/architecture.md) — System design and component overview -- [Installation](INSTALL.md) — Full installation steps and troubleshooting +| Doc | Description | +|-----|-------------| +| [INSTALL.md](INSTALL.md) | Installation, backends, and troubleshooting | +| [Riva Setup](docs/setup_riva.md) | NVIDIA Riva ASR/TTS (Jetson/ARM64) | +| [VLM Guide](docs/vlm_guide.md) | Vision-language models, frame capture, tuning | +| [Architecture](docs/architecture.md) | System design and components | ## 🤝 Contributing @@ -140,6 +120,3 @@ This project is under active development. Issues, pull requests, and feedback ar Apache License 2.0 - See [LICENSE](LICENSE) file for details. -## 🙏 Acknowledgments - -Built on top of proven concepts from [Live RIVA WebUI](https://github.com/yourusername/live-riva-webui). diff --git a/docs/setup_riva.md b/docs/setup_riva.md index 391b3ff..74252d8 100644 --- a/docs/setup_riva.md +++ b/docs/setup_riva.md @@ -21,13 +21,57 @@ This guide walks through setting up NVIDIA Riva locally for voice (ASR/TTS). It - **Jetson Platform**: Jetson Orin, Thor, AGX Xavier, or newer (ARM64/L4T) - **JetPack**: Recent JetPack version (6.0+ recommended) -- **Docker + NVIDIA Container Toolkit**: Pre-installed on JetPack +- **Docker + NVIDIA Container Toolkit**: Pre-installed on JetPack. The ARM64 quickstart uses **plain Docker only** (`docker run`, `docker exec`, etc.) — **Docker Compose is not required**. - **NGC account with Riva access**: Required for downloading Riva resources - Try your account that has Riva entitlements (company or personal) - NVIDIA employees: Internal access may require specific team membership - External users: May need AI Enterprise trial or proper entitlements - **Tip**: If one account doesn't work, try another you have access to +## Configure Docker for GPU (Jetson) — do this first + +Riva runs in a container that needs GPU access. On Jetson, Docker must use the **NVIDIA Container Runtime** and be **restarted** after any config change. Doing this once at the start avoids the "container stays Created" / "use --runtime=nvidia instead" errors. + +1. **Ensure `/etc/docker/daemon.json` has the NVIDIA runtime and default** + + If the file doesn't exist or is empty, create it. Otherwise merge the `runtimes` and `default-runtime` into your existing config: + + ```json + { + "runtimes": { + "nvidia": { + "path": "nvidia-container-runtime", + "runtimeArgs": [] + } + }, + "default-runtime": "nvidia" + } + ``` + + Example (create or edit with sudo): + + ```bash + sudo nano /etc/docker/daemon.json + ``` + +2. **Restart Docker so the config is applied** + + **This step is required.** Changes to `daemon.json` do not apply until Docker is restarted. + + ```bash + sudo systemctl restart docker + ``` + +3. **Optional: verify GPU access in a container** + + ```bash + docker run --rm --runtime=nvidia nvcr.io/nvidia/cuda:13.0.0-runtime-ubuntu24.04 nvidia-smi + ``` + + You should see your GPU; if not, check NVIDIA Container Toolkit and JetPack install. + +Then continue with Part 1 (NGC CLI) below. + ## Part 1: Install NGC CLI The NGC CLI is required to download Riva's quickstart bundle from NVIDIA's catalog. @@ -285,30 +329,28 @@ Preparing model repository... ## Part 6: Start Riva Services (Jetson) +Run from inside the quickstart directory (so `config.sh` is found): + ```bash cd riva_quickstart_arm64_v2.24.0 bash riva_start.sh ``` -This launches the Riva server via Docker Compose. Services: -- **riva-speech**: gRPC server on port `50051` (ASR/TTS) -- **riva-client**: Client container with sample scripts and test files +This launches the Riva server via **Docker** (the script uses `docker run`; no Docker Compose). One container is started: +- **riva-speech**: gRPC server on port `50051` (ASR/TTS). A client shell or sample scripts are available separately via `riva_start_client.sh` (see Part 7). **Note for USB audio**: If using USB microphone/speaker, connect it **before** running `riva_start.sh`. The script will automatically mount it into the container. ### Verify Deployment ```bash -# Check container status -docker compose ps - -# Expected output: -# NAME STATUS -# riva-speech Up X minutes -# riva-client Up X minutes +# Check that the riva-speech container is running (name comes from config.sh) +docker ps -f "name=riva-speech" -# Check logs -docker compose logs -f riva-speech +# Check logs if anything looks wrong +docker logs riva-speech +# Follow logs in real time: +docker logs -f riva-speech ``` Look for successful startup message: @@ -340,7 +382,7 @@ riva_streaming_asr_client --list_models # 'en-US': 'parakeet-1.1b-en-us-asr-streaming' ``` -**Note**: Riva 2.24.0 on Jetson defaults to **Parakeet 1.1b**, which is optimized for low-latency streaming ASR. This is the recommended model for real-time voice applications like Live RIVA WebUI. +**Note**: Riva 2.24.0 on Jetson defaults to **Parakeet 1.1b**, which is optimized for low-latency streaming ASR. This is the recommended model for real-time voice applications (e.g. Multi-modal AI Studio, Live RIVA WebUI). ### Test Streaming ASR (Primary mode for Parakeet 1.1b) @@ -384,7 +426,7 @@ Throughput: 8.3569e+00 RTFX **Streaming ASR is the primary mode for Riva 2.24.0 on Jetson**: - Low latency (~100-200ms) - Real-time interim results -- Optimized for conversational AI applications like Live RIVA WebUI +- Optimized for conversational AI applications (e.g. Multi-modal AI Studio, Live RIVA WebUI) ### Test with Opus file (WebRTC codec) @@ -456,7 +498,7 @@ This stops and removes containers while preserving downloaded models in the `riv ``` ┌─────────────────────────────────────────────────────────────┐ -│ Live RIVA WebUI │ +│ Voice apps (Multi-modal AI Studio / Live RIVA WebUI) │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Browser │◄───────►│ WebUI │ │ │ │ (WebRTC) │ WS/RTC │ Server │ │ @@ -465,7 +507,7 @@ This stops and removes containers while preserving downloaded models in the `riv ├───────────────────────────────────┼─────────────────────────┤ │ Docker │ │ │ ┌────────────────────────────────▼───────────────────┐ │ -│ │ riva-speech-api (port 50051) │ │ +│ │ riva-speech (port 50051) │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ │ │ ASR │ │ TTS │ │ NMT │ │ │ │ │ │ StreamingR │ │ Synthesize │ │ Translate │ │ │ @@ -484,11 +526,11 @@ This stops and removes containers while preserving downloaded models in the `riv ## Part 9: WebRTC and Opus Audio -### Why Opus matters for Live RIVA WebUI +### Why Opus matters for voice applications -**Opus is WebRTC's standard audio codec** - all modern browsers encode microphone audio as Opus by default. Riva's inclusion of Opus sample files (`/opt/riva/wav/en-US_sample.opus`) confirms it can handle this codec natively. +**Opus is WebRTC's standard audio codec** — all modern browsers encode microphone audio as Opus by default. Riva's inclusion of Opus sample files (`/opt/riva/wav/en-US_sample.opus`) confirms it can handle this codec natively. -For Live RIVA WebUI, the audio flow will be: +For Multi-modal AI Studio and Live RIVA WebUI, the audio flow is: ``` Browser (WebRTC) → Opus audio → WebSocket → Bridge → PCM → Riva gRPC → Transcripts ``` @@ -506,13 +548,13 @@ NVIDIA provides an **open-source WebSocket ↔ Riva bridge**: [nvidia-riva/webso **Implementation**: JavaScript/Node.js -For Live RIVA WebUI, we can either: +For Multi-modal AI Studio and Live RIVA WebUI, options include: 1. Use the nvidia-riva/websocket-bridge as-is -2. Build a Python version integrated into our existing async server (reusing Live VLM WebUI's WebRTC scaffolding) +2. Build a Python version integrated into the existing async server (e.g. reusing Live VLM WebUI's WebRTC scaffolding) -## Next Steps for Live RIVA WebUI +## Next Steps (Multi-modal AI Studio / Live RIVA WebUI) -1. **Audio Bridge**: Build WebSocket/WebRTC → gRPC adapter +1. **Audio Bridge**: WebSocket/WebRTC → gRPC adapter - Accept Opus audio from browser - Decode Opus → PCM (or use Riva's native Opus support) - Stream to `riva_asr.StreamingRecognize` gRPC @@ -526,7 +568,7 @@ For Live RIVA WebUI, we can either: - `riva_tts.SynthesizeOnline` gRPC - Send audio back to browser via WebRTC -4. **UI**: React/TypeScript frontend +4. **UI**: Web frontend (e.g. React/TypeScript) - Mic capture (WebRTC audio) - Live captions overlay - Chat transcript panel @@ -536,6 +578,195 @@ For Live RIVA WebUI, we can either: ## Troubleshooting +### "Waiting for Riva server to load all models... retrying in 10 seconds" (never finishes) + +The Riva server container is not becoming healthy within the timeout. The quickstart uses **plain Docker** (no Compose); troubleshoot as follows. + +1. **Run from the quickstart directory** (so `config.sh` is loaded) + ```bash + cd /path/to/riva_quickstart_arm64_v2.24.0 + bash riva_start.sh + ``` + +2. **Check container status** + ```bash + docker ps -a -f "name=riva-speech" + ``` + - If **riva-speech** is missing or status is **Exited**: the container failed. Check logs (step 3). + - If it is **Up** but the script still retries: health check may be slow (first load can take several minutes), or the server may be failing internally — check logs. + +3. **Inspect riva-speech logs** + The script suggests: `docker logs riva-speech`. If that shows nothing, see [Health ready check failed and empty logs](#health-ready-check-failed-and-empty-docker-logs-riva-speech). + ```bash + docker logs riva-speech + docker logs --tail=200 riva-speech + ``` + Look for: + - **GPU / CUDA errors**: Ensure `nvidia-smi` works and NVIDIA Container Toolkit is installed. + - **Model not found / path errors**: Re-run `bash riva_init.sh` and ensure it completed without errors. + - **Out of memory**: Jetson may need more swap or fewer models; disable TTS or NLP in `config.sh` to reduce memory. + +4. **Restart cleanly** + ```bash + bash riva_stop.sh + bash riva_start.sh + ``` + In another terminal, run `docker logs -f riva-speech` to watch startup output. + +### "Health ready check failed" and empty `docker logs riva-speech` + +The script suggests `docker logs riva-speech` (the container name is set in `config.sh` as `riva_daemon_speech="riva-speech"`). If that command prints **nothing**, check the container **STATUS** with `docker ps -a -f "name=riva-speech"`: + +- **STATUS = Created** → The container was created but **never started** (main process never ran). See [Container stuck in Created](#container-stuck-in-created-never-started) below. +- **STATUS = Exited** → The process ran then exited; see step 2 below. +- **STATUS = Up** → Container is running; logs may appear after a short delay, or try `docker logs -f riva-speech`. + +1. **Confirm the container exists and its name** + ```bash + docker ps -a | grep -i riva + ``` + The quickstart creates a container named **riva-speech**. If you see a different name (e.g. from an older run or custom config), use that: + ```bash + docker logs + ``` + +2. **Container exited immediately** + If the container is **Exited**, it may have crashed before writing much. You can still try: + ```bash + docker logs riva-speech + docker logs --tail=200 riva-speech + ``` + Exited containers often keep stdout/stderr; if logs are still empty, the process may have died before any output. Run again and watch in real time: + ```bash + bash riva_stop.sh + bash riva_start.sh + ``` + In a second terminal, as soon as the container starts: + ```bash + docker logs -f riva-speech + ``` + Look for GPU/CUDA, model path, or OOM errors in the first lines. + +### Container stuck in **Created** (never started) + +If `docker ps -a -f "name=riva-speech"` shows **STATUS = Created** (and no "Up" time), the container was created by `docker run -d` but the main process never started. There are no logs because the entrypoint hasn't run. Common causes: missing or inaccessible device (e.g. GPU, USB/sound), volume mount failure, or Docker/runtime blocking start. + +**Do this:** + +1. **Remove the stuck container and try again from the quickstart directory** + ```bash + docker rm -f riva-speech + cd /path/to/riva_quickstart_arm64_v2.24.0 + bash riva_start.sh + ``` + In a second terminal, watch for the container to go from Created → Up and then stream logs: + ```bash + watch -n 1 'docker ps -a -f "name=riva-speech"' + # When STATUS becomes "Up", run: + docker logs -f riva-speech + ``` + +2. **If it stays in Created again**, try starting it manually to see the error: + ```bash + docker start riva-speech + docker logs -f riva-speech + ``` + If `docker start` fails or logs show nothing, inspect the container: + ```bash + docker inspect riva-speech + ``` + Check **`State.Error`** for the exact message. A very common one on Jetson is below. + +4. **If you see: "invoking the NVIDIA Container Runtime Hook directly ... use the NVIDIA Container Runtime (--runtime=nvidia) instead"** + See [Riva container stays "Created": use NVIDIA Container Runtime](#riva-container-stays-created-use-nvidia-container-runtime) below. + +5. **Verify GPU and devices** + The Riva start script mounts `--gpus` and on Tegra also `--device /dev/bus/usb --device /dev/snd`. Ensure: + - `nvidia-smi` works and NVIDIA Container Toolkit is installed. + - No security profile (e.g. AppArmor) is blocking device access. + - If you don't need USB/sound for the server, you could temporarily comment out the extra `--device` flags in `riva_start.sh` to see if the container then starts (for debugging only). + +### Container starts then exits immediately (or stays "Created") + +If the container goes **Created** and never shows **Up**, or it exits so quickly that `docker logs riva-speech` is empty, the script is hiding the error: it runs `docker run -d ... &> /dev/null`, so all output is discarded. Run the same container **in the foreground** so you see the real error (CUDA, model path, OOM, etc.): + +```bash +cd /path/to/riva_quickstart_arm64_v2.24.0 +source config.sh + +# Remove any existing container so we can use the same name +docker rm -f riva-speech 2>/dev/null + +# Same as riva_start.sh but -it (foreground) and no -d; output goes to your terminal +docker run -it --rm \ + --init --ipc=host \ + --gpus "$gpus_to_use" \ + -p $riva_speech_api_port:$riva_speech_api_port \ + -p $riva_speech_api_http_port:$riva_speech_api_http_port \ + -e RIVA_SERVER_HTTP_PORT=$riva_speech_api_http_port \ + -e "LD_PRELOAD=$ld_preload" \ + -e "RIVA_API_KEY=$RIVA_API_KEY" \ + -e "RIVA_API_NGC_ORG=$RIVA_API_NGC_ORG" \ + -e "RIVA_EULA=$RIVA_EULA" \ + -v $riva_model_loc:/data \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 \ + $image_speech_api \ + start-riva --riva-uri=0.0.0.0:$riva_speech_api_port \ + --asr_service=$service_enabled_asr \ + --tts_service=$service_enabled_tts \ + --nlp_service=$service_enabled_nlp +``` + +(On Tegra the script also adds `--device /dev/bus/usb --device /dev/snd`; if the command above runs and you need those, add them before `$image_speech_api`.) + +- **What you see** is the real failure (e.g. "could not load model", "CUDA error", "No such file", OOM). Fix that and then use `bash riva_start.sh` again. +- **If it stays in Created** even with this foreground run, the failure is before the process starts (e.g. device or runtime); check `docker events` in another terminal and run the `docker run` above to see the event error. + +### Riva container stays "Created": use NVIDIA Container Runtime + +If `docker inspect riva-speech` shows in **State.Error** something like: + +```text +invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. +Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead: unknown +``` + +then Docker on this host is set up to use the **NVIDIA Container Runtime** (full runtime), not the hook used by `--gpus`. The container never starts because the runtime rejects the `--gpus`-based GPU setup. + +**Fix A — Configure Docker to use the NVIDIA runtime by default (recommended)** +Ensure `/etc/docker/daemon.json` has the nvidia runtime and set it as default: + +```json +{ + "runtimes": { + "nvidia": { + "path": "nvidia-container-runtime", + "runtimeArgs": [] + } + }, + "default-runtime": "nvidia" +} +``` + +If the file already has `"runtimes": { "nvidia": ... }` but no `"default-runtime": "nvidia"`, add that. Then: + +```bash +sudo systemctl restart docker +``` + +After that, run `bash riva_start.sh` again. + +**Fix B — Workaround: use `--runtime=nvidia` in the start script** +If you prefer not to change the default runtime, patch `riva_start.sh` so the container uses the nvidia runtime on Tegra instead of `--gpus`: + +1. Open `riva_start.sh` in your quickstart directory. +2. Find the line: `--gpus '"'$gpus_to_use'"' \` +3. Replace it with: `--runtime=nvidia \` + (This is safe for Jetson/Tegra; the nvidia runtime gives the container GPU access.) + +Then run `bash riva_start.sh` again. + ### "403 Forbidden" when downloading quickstart - **Cause**: NGC account lacks Riva entitlement @@ -552,7 +783,7 @@ For Live RIVA WebUI, we can either: - **Verify GPU**: `nvidia-smi` should show your GPU - **Check toolkit**: `docker run --rm --gpus all ubuntu nvidia-smi` -- **Review logs**: `docker compose logs riva-speech-api` +- **Review logs**: `docker logs riva-speech` (container name from `config.sh`: `riva_daemon_speech`) ### Models downloading very slowly @@ -570,7 +801,7 @@ For Live RIVA WebUI, we can either: --- **Document Status**: Updated for Jetson ARM64 deployment (x86 support discontinued) -**Last Updated**: January 2025 +**Last Updated**: March 2025 **Riva Version**: 2.24.0 (ARM64) **Platform**: NVIDIA Jetson Thor (JAT03) diff --git a/presets/cosmos-reason.yaml b/presets/cosmos-reason.yaml index 2d2966f..99c4850 100644 --- a/presets/cosmos-reason.yaml +++ b/presets/cosmos-reason.yaml @@ -15,7 +15,7 @@ asr: llm: scheme: openai - api_base: http://localhost:8003/v1 + api_base: http://localhost:8010/v1 model: /model temperature: 0.3 # Low temp — critical for precise, consistent vision responses max_tokens: 512 # Hard cap on reasoning+answer combined; model uses ~150-275 total diff --git a/pyproject.toml b/pyproject.toml index e6118a2..ddbd68a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -3,6 +3,7 @@ requires = ["setuptools>=61.0", "wheel"] build-backend = "setuptools.build_meta" [project] +# Distribution name (hyphens); Python package is multi_modal_ai_studio (underscores) name = "multi-modal-ai-studio" version = "0.1.0" description = "Multi-modal AI interface with voice, text, and video support for analyzing conversational AI systems" @@ -10,7 +11,9 @@ readme = "README.md" requires-python = ">=3.8" license = {text = "Apache-2.0"} authors = [ - {name = "Your Name", email = "your.email@example.com"} + {name = "Chitoku YATO (tokk-nv)", email = "cyato@nvidia.com"}, + {name = "Aditya Sahu (adsahu-nv)", email = "adsahu@nvidia.com"}, + {name = "kbenkhaled", email = "kbenkhaled@nvidia.com"}, ] keywords = ["ai", "voice", "multimodal", "riva", "nvidia", "openai", "conversational-ai"] classifiers = [ @@ -61,10 +64,10 @@ webrtc-camera = [ multi-modal-ai-studio = "multi_modal_ai_studio.cli.main:main" [project.urls] -Homepage = "https://github.com/yourusername/multi-modal-ai-studio" -Documentation = "https://github.com/yourusername/multi-modal-ai-studio/tree/main/docs" -Repository = "https://github.com/yourusername/multi-modal-ai-studio" -Issues = "https://github.com/yourusername/multi-modal-ai-studio/issues" +Homepage = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio" +Documentation = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio/tree/main/docs" +Repository = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio" +Issues = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio/issues" [tool.setuptools.packages.find] where = ["src"] diff --git a/src/multi_modal_ai_studio/backends/llm/openai.py b/src/multi_modal_ai_studio/backends/llm/openai.py index ff821be..8389660 100644 --- a/src/multi_modal_ai_studio/backends/llm/openai.py +++ b/src/multi_modal_ai_studio/backends/llm/openai.py @@ -346,6 +346,7 @@ async def list_available_models(self) -> List[str]: """List available models from the LLM API. Attempts to detect models from Ollama's native API or OpenAI endpoint. + We probe /api/tags first (Ollama can run on any port), then fall back to /v1/models. Returns: List of model names, or empty list if detection fails @@ -396,6 +397,13 @@ async def list_available_models(self) -> List[str]: # Vision content formatting — one method per API format # ----------------------------------------------------------------- + def _api_supports_video_url(self) -> bool: + """True if the API is known to support video_url. Ollama only supports image_url.""" + base = (self.api_base or "").lower() + if ":11434" in base or "ollama" in base: + return False + return True + def _build_vision_content( self, image_data_urls: List[str], @@ -404,15 +412,14 @@ def _build_vision_content( ) -> list: """Build the multimodal ``content`` list for a user message. - Two paths controlled purely by the user's ``vision_video_encode`` config: - - True → encode frames as MP4 video (``video_url``), fall back to images on failure - - False → send individual images (``image_url``) - - No backend detection, no sub-sampling, no hardcoded caps. + Two paths: video (video_url) when enabled and API supports it, else images (image_url). + Ollama does not support video_url and returns 400; we send image_url only for it. """ content: list = [{"type": "text", "text": prompt}] - use_video = bool(getattr(self.config, "vision_video_encode", False)) + use_video = bool(getattr(self.config, "vision_video_encode", False)) and self._api_supports_video_url() + if getattr(self.config, "vision_video_encode", False) and not self._api_supports_video_url(): + self.logger.debug("Vision: API does not support video_url (e.g. Ollama); using image_url only") if use_video and len(image_data_urls) >= 2: video_url = _encode_images_to_video_base64( diff --git a/src/multi_modal_ai_studio/webui/static/app.js b/src/multi_modal_ai_studio/webui/static/app.js index f2bf910..4fb7fe3 100644 --- a/src/multi_modal_ai_studio/webui/static/app.js +++ b/src/multi_modal_ai_studio/webui/static/app.js @@ -1295,7 +1295,7 @@ function renderLLMConfig(config, readonly = false) {