Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 147 additions & 5 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,15 +361,18 @@ Ollama serves on `http://localhost:11434/v1` by default. In the UI, set:

vLLM provides high-throughput serving with GPU acceleration. Example with Cosmos-Reason2 for vision:

**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images.
**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images. The FP8 model is downloaded from NGC; you need an NGC account with access to the **nim** org (and often the **nvidia** team). If NGC download fails, see [NGC Cosmos model download fails](#ngc-cosmos-model-download-fails-completed-0-failed-n) below.

Quick reference for Jetson Thor (after downloading the FP8 model per the link above):

```bash
export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"

mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3
sudo docker run -it --rm --runtime=nvidia --network host \
-v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
-v ${HOME}/.cache/vllm:/root/.cache/vllm \
ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
vllm serve /models/cosmos-reason2-8b \
--served-model-name nvidia/cosmos-reason2-8b-fp8 \
Expand All @@ -378,13 +381,152 @@ sudo docker run -it --rm --runtime=nvidia --network host \
--reasoning-parser qwen3 \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--enable-prefix-caching \
--port 8000
--port 8010
```

The second volume `-v ${HOME}/.cache/vllm:/root/.cache/vllm` persists vLLM’s **torch.compile cache** on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create `~/.cache/vllm` **before** the first run (as in the example above) so it is owned by your user; otherwise the container may create it as root and you can hit permission issues later.

`vm.drop_caches=3` frees **system (CPU) memory** (page cache, etc.); it does **not** free **GPU VRAM**. If you start vLLM a second time while the first container is still running, the GPU has no free VRAM and vLLM will fail with "Free memory on device cuda:0 (...) is less than desired". **Stop the first vLLM container** (e.g. Ctrl+C or `docker stop`) so the driver releases GPU memory, then start again.

> **Port conflict with Riva**: The **Riva container** exposes ports **8000–8002** (and 8888, 50051). If you run both Riva and vLLM on the same machine, use a different vLLM port so they don't clash. The example above uses `--port 8010`; in the app set **LLM API Base** to `http://localhost:8010/v1`. If Riva is not running, `--port 8000` is fine.
>
> **Memory tuning**: On shared-memory systems (Jetson), lower `--gpu-memory-utilization` to leave room for the OS, Riva, and the application. On discrete GPUs with dedicated VRAM, `0.8` is safe.
>
> **Desktop GPU / x86_64**: Use `vllm/vllm-openai:latest` or `nvcr.io/nvidia/vllm:latest` instead of the Jetson image.

### vLLM troubleshooting

#### `OSError: [Errno 98] Address already in use`

vLLM fails at startup with `sock.bind(addr) OSError: [Errno 98] Address already in use` when the API port (default **8000**) is already taken—for example by a previous vLLM run, another container, or another service.

**1. Find what is using the port**

```bash
# Default vLLM port is 8000; use your --port if different
lsof -i :8000
# or
ss -tlnp | grep 8000
# or
fuser 8000/tcp
```

If **`ss` shows port 8000 in LISTEN but `lsof` and `fuser` show no PID**, the process is usually **inside a Docker container**. List containers and look for one that has port 8000:

```bash
docker ps -a
# Look for a container with 0.0.0.0:8000->8000/tcp or similar in PORTS
```

**2. Free the port or use another**

- **Riva is using 8000** (container `riva-speech` exposes 8000–8002): Don't stop Riva. Start vLLM on a different port and point the app to it:
```bash
# In the vllm serve command, use e.g.:
--port 8010

# In Multi-modal AI Studio, set LLM API Base to:
# http://localhost:8010/v1
```
- **Another Docker container** (e.g. leftover vLLM): Stop and remove it if you don't need it:
```bash
docker ps -a
docker stop <container_id_or_name>
docker rm <container_id_or_name>
# or: docker rm -f <container_id_or_name>
```
- **Process on the host** (when lsof/fuser show a PID): Kill it:
```bash
kill <PID>
# or: fuser -k 8000/tcp
```

**3. Use a different port**

If you need to keep whatever is on 8000, start vLLM with `--port 8010` (or another free port) and set the app's **LLM API Base** to `http://localhost:8010/v1`.

#### `ValueError: Free memory on device cuda:0 (...) is less than desired GPU memory utilization`

Another process (often a **previous vLLM container**) is still using the GPU, so there isn’t enough free VRAM. Stop the other process: if the first vLLM was started in another terminal, press **Ctrl+C** there, or run `docker ps` and `docker stop <container_id>`. The driver may take **30–60 seconds** to release VRAM after the container exits; run `nvidia-smi` and wait until free memory is back to normal before starting vLLM again. `vm.drop_caches=3` only frees system RAM, not GPU VRAM.

#### `ValidationError: Invalid repository ID or local directory specified: '/models/...'`

vLLM fails during startup with a message like **Invalid repository ID or local directory specified: '/models/cosmos-reason2-8b'** when the model path inside the container is missing, wrong, or doesn't contain the expected config files.

**1. Check the model directory on the host**

Ensure `MODEL_PATH` points to the directory that contains the model files (e.g. `config.json` for Hugging Face–style models):

```bash
echo $MODEL_PATH
ls -la "$MODEL_PATH"
# Must contain at least: config.json (and usually model weights, tokenizer files, etc.)
```

If the directory is missing or empty, download the model first (see [Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/) or your model’s instructions).

**2. Check the volume mount**

The `docker run` command must mount that host path into the container path vLLM uses:

```bash
# Example: host path -> container path /models/cosmos-reason2-8b
-v $MODEL_PATH:/models/cosmos-reason2-8b:ro
```

- Use an **absolute path** for `MODEL_PATH` (e.g. `$HOME/.cache/huggingface/hub/...`), not a relative one, so the mount is correct from any working directory.
- The path after the colon must match the path you pass to `vllm serve` (e.g. `vllm serve /models/cosmos-reason2-8b`).

**3. Verify the container sees the files**

Run a quick check that the mounted directory exists and has a config inside the container:

```bash
docker run --rm -v "$MODEL_PATH:/models/cosmos-reason2-8b:ro" \
ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
ls -la /models/cosmos-reason2-8b
```

You should see `config.json` and other model files. If the list is empty or "No such file or directory", fix `MODEL_PATH` or the mount path and try again.

#### Fix Hugging Face cache permissions (root-owned)

If `~/.cache/huggingface` or `~/.cache/huggingface/hub` is owned by **root** (e.g. created by [jetson-containers](https://github.com/dusty-nv/jetson-containers) or another tool running with `sudo`), commands run as your user (NGC CLI, Python, Hugging Face libraries) will get **Permission denied** when writing there.

**Fix:** make the cache tree owned by the current user:

```bash
sudo chown -R $USER:$USER ~/.cache/huggingface
```

Then retry the download or command that was failing. To avoid the issue in the future, create the directory as your user before any tool that might run as root: `mkdir -p ~/.cache/huggingface/hub`.

#### NGC Cosmos model download fails (Completed: 0, Failed: N)

If `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub` fails with **Completed: 0, Failed: 14**:

- **Permission denied when writing files:** If the debug log shows `[Errno 13] Permission denied` for paths under `~/.cache/huggingface/hub/`, the destination is likely **root-owned**. See [Fix Hugging Face cache permissions (root-owned)](#fix-hugging-face-cache-permissions-root-owned) above: run `sudo chown -R $USER:$USER ~/.cache/huggingface`, then retry. If you only need to fix the model subdirectory: `sudo chown -R $USER:$USER ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` (and ensure the parent `hub` is writable). Alternatively remove the partial dir and re-download: `rm -rf ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` then run the same `ngc registry model download-version ...` again, or use a different `--dest` you can write to.

- **403 or auth/org errors:** If the debug log shows **403** or org/entitlement errors (rather than Permission denied), try setting the effective org to **nvidia**. The NGC CLI uses **`NGC_CLI_ORG`** from the environment. Example for `~/.bashrc`:
```bash
export NGC_CLI_ORG=nvidia
# optional: export NGC_CLI_API_KEY=<your-key>
```
Then in the same shell (or a new terminal after `source ~/.bashrc`):
```bash
ngc config current # check effective org
ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub
export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
```
If the env var is not picked up, use `--org nvidia` on the command (next bullet). The download often succeeds with the default org (e.g. with `NGC_CLI_ORG` unset); only try `nvidia` if you see 403 or org/entitlement errors.

- **Explicit org/team:** `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --org nim --team nvidia --dest ~/.cache/huggingface/hub`

- **Browser:** If you can download from the [catalog page](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/models/cosmos-reason2-8b?version=1208-fp8-static-kv8) in the browser, save the files into `~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` and set `MODEL_PATH` to that directory.

- **Different machine:** If the same API key and `NGC_CLI_ORG=nvidia` work on one host but not another, the failing host may differ by network, NGC CLI version, or backend. Run with **`--debug`** to see the underlying error: `ngc --debug registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub`. Reliable workaround: **copy the model from the working machine** (e.g. from jat03): `rsync -avz jetson@jat03-iso384:~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/ ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` then set `MODEL_PATH` to that directory.


### Option C: OpenAI API

No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide your API key, and choose a model (`gpt-4o` for vision, `gpt-4o-mini` for text).
Expand All @@ -395,9 +537,9 @@ No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide
# Ollama
curl -s http://localhost:11434/v1/models | python3 -m json.tool

# vLLM
curl -s http://localhost:8000/v1/models | python3 -m json.tool
curl -s http://localhost:8000/health && echo "READY" || echo "NOT READY"
# vLLM (use your port if different, e.g. 8010 when Riva uses 8000)
curl -s http://localhost:8010/v1/models | python3 -m json.tool
curl -s http://localhost:8010/health && echo "READY" || echo "NOT READY"
```

### Using Vision
Expand Down
105 changes: 41 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,50 +2,47 @@

![](./docs/images/screenshot_example_2.png)

**Voice, Text, and Video AI Interface with Advanced Performance Analysis**
**Voice, text, and video conversational AI with session analysis and latency metrics**

Multi-modal AI Studio is a next-generation conversational AI interface designed for analyzing and optimizing voice AI systems. Built on NVIDIA Riva, OpenAI APIs, and other backends, it features sophisticated session management, real-time timeline visualization, and comprehensive latency metrics.
Multi-modal AI Studio is a conversational AI interface for building and tuning voice AI systems. It supports NVIDIA Riva, OpenAI, and other backends; records sessions with full config snapshots; and provides a real-time timeline and latency analysis (TTFA, turn-taking) to compare and optimize setups.

## 🌟 Key Features

### Multi-modal Support
- **Voice Input/Output**: Streaming ASR and TTS via Riva or OpenAI
- **Text Chat**: Traditional text-based conversation
- **Video**: Camera feed for vision-enabled models (future)
- **Mixed Modes**: Voice-to-text, text-to-voice, or text-only
- **Voice**: Streaming ASR and TTS (Riva, OpenAI, or other backends)
- **Text**: Chat-only mode or combined with voice
- **Video**: Camera feed for vision-language models (VLM); browser WebRTC or server USB webcam
- **Mixed modes**: Voice-to-text, text-to-voice, voice-to-voice, or text-only

### Multi-backend Architecture
- **NVIDIA Riva**: gRPC streaming ASR/TTS
- **OpenAI**: REST API (Whisper, TTS) and Realtime API
- **Azure Speech**: Coming soon
- **Custom backends**: Extensible plugin system
- Speech
- **NVIDIA Riva**: gRPC streaming ASR/TTS (Jetson/ARM64)
- **OpenAI-compatible Realtime API**: Realtime API
- LLM: **OpenAI-compatible** REST API, to works with many inference engines for various LLM/VLM models
- **Extensible**: Plugin-style backends; Azure Speech and others can be added

### Session Management
- **Configuration Snapshots**: Every session saves ASR/LLM/TTS configs
- **Timeline Recording**: Store performance data for offline analysis
- **Preset System**: Save and load configuration presets
- **Export/Import**: Generate CLI commands or YAML configs from WebUI
- **Config snapshots**: Every session stores ASR/LLM/TTS and device settings
- **Timeline recording**: Performance data for offline analysis
- **Presets**: Save and load configuration presets

### Performance Analysis
- **Real-time Timeline**: Multi-lane visualization (Audio, Speech, LLM, TTS)
- **Latency Metrics**: TTFA (Time to First Audio), turn-taking analysis
- **Comparison Mode**: Compare multiple sessions to optimize configs
- **Session Replay**: Analyze recorded timeline data
- **Real-time timeline**: Multi-lane view (Audio, Speech, LLM, TTS)
- **Latency metrics**: TTFA (Time to First Audio), turn-taking

### Flexible Deployment
- **WebUI Mode**: Rich browser interface (default)
- **Headless Mode**: CLI-only for production/automation (not yet implemented)
- **Audio/Video devices**: **Currently supported:** browser devices via WebRTC (mic, speaker, camera through the browser). **Not yet supported:** local USB microphone, USB speaker, or USB webcam attached to the server machine.
### UI & Devices
- **Chat-style UI**: Familiar layout, video full-screen mode, keyboard shortcuts. Most settings are exposed in the UI (ASR/LLM/TTS, models, devices) so you can tweak and switch backends without editing config files or code.
- **Devices**: Client-side (browser WebRTC) and server-side (Linux USB mic, USB speaker, USB webcam); choose in the Devices tab.
- **Headless** (experimental, not well tested): CLI with config file or args; see [INSTALL.md](INSTALL.md).

## 🚀 Quick Start

### Prerequisites

- Python 3.8+
- **Audio/video**: Use the app in a browser; mic, speaker, and camera are accessed via WebRTC (browser devices). Local USB mic/speaker/webcam on the server are not supported yet.
- NVIDIA Riva (for Riva backend) - see [INSTALL.md](INSTALL.md#nvidia-riva-setup-for-voice-asrtts)
- OpenAI API key (for OpenAI backend) - optional
- **Optional**: `jq` (e.g. `apt install jq` or `brew install jq`) for pretty-formatted LLM request/response logs in the server console; without it, logs use plain JSON
- **Python 3.8+**
- **Audio/video**: Browser (WebRTC) for mic, speaker, and camera. On Linux, server **USB microphone**, **USB speaker**, and **USB webcam** are also supported; see [INSTALL.md](INSTALL.md).
- **Backends (as needed)**: [NVIDIA Riva](INSTALL.md#nvidia-riva-setup-for-voice-asrtts) for ASR/TTS; OpenAI API key for OpenAI/Realtime backends (optional).
- **Optional**: `jq` for pretty-printed LLM logs in the console (`apt install jq` or `brew install jq`).

### Installation

Expand All @@ -72,27 +69,9 @@ Full steps and troubleshooting: [INSTALL.md](INSTALL.md)
```bash
# View sessions and timeline (no backend required)
python -m multi_modal_ai_studio --port 8092

# With Riva ASR/TTS (use --asr-server and --tts-server)
python -m multi_modal_ai_studio \
--port 8092 \
--asr-server localhost:50051 \
--tts-server localhost:50051 \
--llm-api-base http://localhost:11434/v1 \
--llm-model llama3.2:3b

# With OpenAI Realtime API
python -m multi_modal_ai_studio \
--port 8092 \
--asr-scheme openai-realtime \
--tts-scheme openai-realtime \
--llm-api-key sk-...

# With preset
python -m multi_modal_ai_studio --preset low-latency
```

Open **http://localhost:8092** in your browser.
Open **http://localhost:8092** in your browser. For voice (Riva, OpenAI, etc.) and other options, see [INSTALL.md](INSTALL.md).

### Kill a Running Server

Expand All @@ -107,30 +86,31 @@ lsof -i :8092
kill <PID>
```

**Sessions and sample data**
By default the app loads and saves sessions in `sessions/`. To view or use the sample/mock session JSONs (e.g. in `mock_sessions/`), run with `--session-dir mock_sessions`. Open the app, then click a session in the sidebar to view its config and timeline.
### Sessions and sample data

Sessions are stored in `sessions/` by default. To try sample timelines, run with `--session-dir mock_sessions` and open a session from the sidebar.

### Run Headless
### Run headless (experimental)

CLI-only mode for automation or local audio devices. Requires the `[audio]` extra and device setup; see [INSTALL.md](INSTALL.md).

```bash
# From config file
python -m multi_modal_ai_studio --mode headless --config my-config.yaml

# From CLI args
python -m multi_modal_ai_studio \
--mode headless \
--audio-input alsa:hw:0,0 \
--audio-output alsa:hw:1,0 \
--asr-scheme riva \
--llm-model llama3.2:3b
# Or with CLI args (e.g. ALSA devices)
python -m multi_modal_ai_studio --mode headless \
--audio-input alsa:hw:0,0 --audio-output alsa:hw:1,0 \
--asr-scheme riva --llm-model llama3.2:3b
```

## 📖 Documentation

- [VLM Guide](docs/vlm_guide.md) — Vision-Language Model setup, input modes, frame capture, and tuning
- [Riva Setup](docs/setup_riva.md) — NVIDIA Riva ASR/TTS installation and configuration
- [Architecture](docs/architecture.md) — System design and component overview
- [Installation](INSTALL.md) — Full installation steps and troubleshooting
| Doc | Description |
|-----|-------------|
| [INSTALL.md](INSTALL.md) | Installation, backends, and troubleshooting |
| [Riva Setup](docs/setup_riva.md) | NVIDIA Riva ASR/TTS (Jetson/ARM64) |
| [VLM Guide](docs/vlm_guide.md) | Vision-language models, frame capture, tuning |
| [Architecture](docs/architecture.md) | System design and components |

## 🤝 Contributing

Expand All @@ -140,6 +120,3 @@ This project is under active development. Issues, pull requests, and feedback ar

Apache License 2.0 - See [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

Built on top of proven concepts from [Live RIVA WebUI](https://github.com/yourusername/live-riva-webui).
Loading
Loading