NVIDIA-AI-IOT · tokk-nv · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026
diff --git a/INSTALL.md b/INSTALL.md
@@ -361,15 +361,18 @@ Ollama serves on `http://localhost:11434/v1` by default. In the UI, set:
 
 vLLM provides high-throughput serving with GPU acceleration. Example with Cosmos-Reason2 for vision:
 
-**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images.
+**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images. The FP8 model is downloaded from NGC; you need an NGC account with access to the **nim** org (and often the **nvidia** team). If NGC download fails, see [NGC Cosmos model download fails](#ngc-cosmos-model-download-fails-completed-0-failed-n) below.
 
 Quick reference for Jetson Thor (after downloading the FP8 model per the link above):
 
 ```bash
 export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
 
+mkdir -p ~/.cache/vllm
+sudo sysctl -w vm.drop_caches=3
 sudo docker run -it --rm --runtime=nvidia --network host \
   -v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
+  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
   ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
   vllm serve /models/cosmos-reason2-8b \
     --served-model-name nvidia/cosmos-reason2-8b-fp8 \
@@ -378,13 +381,152 @@ sudo docker run -it --rm --runtime=nvidia --network host \
     --reasoning-parser qwen3 \
     --media-io-kwargs '{"video": {"num_frames": -1}}' \
     --enable-prefix-caching \
-    --port 8000
+    --port 8010
 ```
 
+The second volume `-v ${HOME}/.cache/vllm:/root/.cache/vllm` persists vLLM’s **torch.compile cache** on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create `~/.cache/vllm` **before** the first run (as in the example above) so it is owned by your user; otherwise the container may create it as root and you can hit permission issues later.
+
+`vm.drop_caches=3` frees **system (CPU) memory** (page cache, etc.); it does **not** free **GPU VRAM**. If you start vLLM a second time while the first container is still running, the GPU has no free VRAM and vLLM will fail with "Free memory on device cuda:0 (...) is less than desired". **Stop the first vLLM container** (e.g. Ctrl+C or `docker stop`) so the driver releases GPU memory, then start again.
+
+> **Port conflict with Riva**: The **Riva container** exposes ports **8000–8002** (and 8888, 50051). If you run both Riva and vLLM on the same machine, use a different vLLM port so they don't clash. The example above uses `--port 8010`; in the app set **LLM API Base** to `http://localhost:8010/v1`. If Riva is not running, `--port 8000` is fine.
+>
 > **Memory tuning**: On shared-memory systems (Jetson), lower `--gpu-memory-utilization` to leave room for the OS, Riva, and the application. On discrete GPUs with dedicated VRAM, `0.8` is safe.
 >
 > **Desktop GPU / x86_64**: Use `vllm/vllm-openai:latest` or `nvcr.io/nvidia/vllm:latest` instead of the Jetson image.
 
+### vLLM troubleshooting
+
+#### `OSError: [Errno 98] Address already in use`
+
+vLLM fails at startup with `sock.bind(addr) OSError: [Errno 98] Address already in use` when the API port (default **8000**) is already taken—for example by a previous vLLM run, another container, or another service.
+
+**1. Find what is using the port**
+
+```bash
+# Default vLLM port is 8000; use your --port if different
+lsof -i :8000
+# or
+ss -tlnp | grep 8000
+# or
+fuser 8000/tcp
+```
+
+If **`ss` shows port 8000 in LISTEN but `lsof` and `fuser` show no PID**, the process is usually **inside a Docker container**. List containers and look for one that has port 8000:
+
+```bash
+docker ps -a
+# Look for a container with 0.0.0.0:8000->8000/tcp or similar in PORTS
+```
+
+**2. Free the port or use another**
+
+- **Riva is using 8000** (container `riva-speech` exposes 8000–8002): Don't stop Riva. Start vLLM on a different port and point the app to it:
+  ```bash
+  # In the vllm serve command, use e.g.:
+  --port 8010
+
+  # In Multi-modal AI Studio, set LLM API Base to:
+  # http://localhost:8010/v1
+  ```
+- **Another Docker container** (e.g. leftover vLLM): Stop and remove it if you don't need it:
+  ```bash
+  docker ps -a
+  docker stop <container_id_or_name>
+  docker rm <container_id_or_name>
+  # or: docker rm -f <container_id_or_name>
+  ```
+- **Process on the host** (when lsof/fuser show a PID): Kill it:
+  ```bash
+  kill <PID>
+  # or: fuser -k 8000/tcp
+  ```
+
+**3. Use a different port**
+
+If you need to keep whatever is on 8000, start vLLM with `--port 8010` (or another free port) and set the app's **LLM API Base** to `http://localhost:8010/v1`.
+
+#### `ValueError: Free memory on device cuda:0 (...) is less than desired GPU memory utilization`
+
+Another process (often a **previous vLLM container**) is still using the GPU, so there isn’t enough free VRAM. Stop the other process: if the first vLLM was started in another terminal, press **Ctrl+C** there, or run `docker ps` and `docker stop <container_id>`. The driver may take **30–60 seconds** to release VRAM after the container exits; run `nvidia-smi` and wait until free memory is back to normal before starting vLLM again. `vm.drop_caches=3` only frees system RAM, not GPU VRAM.
+
+#### `ValidationError: Invalid repository ID or local directory specified: '/models/...'`
+
+vLLM fails during startup with a message like **Invalid repository ID or local directory specified: '/models/cosmos-reason2-8b'** when the model path inside the container is missing, wrong, or doesn't contain the expected config files.
+
+**1. Check the model directory on the host**
+
+Ensure `MODEL_PATH` points to the directory that contains the model files (e.g. `config.json` for Hugging Face–style models):
+
+```bash
+echo $MODEL_PATH
+ls -la "$MODEL_PATH"
+# Must contain at least: config.json (and usually model weights, tokenizer files, etc.)
+```
+
+If the directory is missing or empty, download the model first (see [Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/) or your model’s instructions).
+
+**2. Check the volume mount**
+
+The `docker run` command must mount that host path into the container path vLLM uses:
+
+```bash
+# Example: host path -> container path /models/cosmos-reason2-8b
+-v $MODEL_PATH:/models/cosmos-reason2-8b:ro
+```
+
+- Use an **absolute path** for `MODEL_PATH` (e.g. `$HOME/.cache/huggingface/hub/...`), not a relative one, so the mount is correct from any working directory.
+- The path after the colon must match the path you pass to `vllm serve` (e.g. `vllm serve /models/cosmos-reason2-8b`).
+
+**3. Verify the container sees the files**
+
+Run a quick check that the mounted directory exists and has a config inside the container:
+
+```bash
+docker run --rm -v "$MODEL_PATH:/models/cosmos-reason2-8b:ro" \
+  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
+  ls -la /models/cosmos-reason2-8b
+```
+
+You should see `config.json` and other model files. If the list is empty or "No such file or directory", fix `MODEL_PATH` or the mount path and try again.
+
+#### Fix Hugging Face cache permissions (root-owned)
+
+If `~/.cache/huggingface` or `~/.cache/huggingface/hub` is owned by **root** (e.g. created by [jetson-containers](https://github.com/dusty-nv/jetson-containers) or another tool running with `sudo`), commands run as your user (NGC CLI, Python, Hugging Face libraries) will get **Permission denied** when writing there.
+
+**Fix:** make the cache tree owned by the current user:
+
+```bash
+sudo chown -R $USER:$USER ~/.cache/huggingface
+```
+
+Then retry the download or command that was failing. To avoid the issue in the future, create the directory as your user before any tool that might run as root: `mkdir -p ~/.cache/huggingface/hub`.
+
+#### NGC Cosmos model download fails (Completed: 0, Failed: N)
+
+If `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub` fails with **Completed: 0, Failed: 14**:
+
+- **Permission denied when writing files:** If the debug log shows `[Errno 13] Permission denied` for paths under `~/.cache/huggingface/hub/`, the destination is likely **root-owned**. See [Fix Hugging Face cache permissions (root-owned)](#fix-hugging-face-cache-permissions-root-owned) above: run `sudo chown -R $USER:$USER ~/.cache/huggingface`, then retry. If you only need to fix the model subdirectory: `sudo chown -R $USER:$USER ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` (and ensure the parent `hub` is writable). Alternatively remove the partial dir and re-download: `rm -rf ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` then run the same `ngc registry model download-version ...` again, or use a different `--dest` you can write to.
+
+- **403 or auth/org errors:** If the debug log shows **403** or org/entitlement errors (rather than Permission denied), try setting the effective org to **nvidia**. The NGC CLI uses **`NGC_CLI_ORG`** from the environment. Example for `~/.bashrc`:
+  ```bash
+  export NGC_CLI_ORG=nvidia
+  # optional: export NGC_CLI_API_KEY=<your-key>
+  ```
+  Then in the same shell (or a new terminal after `source ~/.bashrc`):
+  ```bash
+  ngc config current   # check effective org
+  ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub
+  export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
+  ```
+  If the env var is not picked up, use `--org nvidia` on the command (next bullet). The download often succeeds with the default org (e.g. with `NGC_CLI_ORG` unset); only try `nvidia` if you see 403 or org/entitlement errors.
+
+- **Explicit org/team:** `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --org nim --team nvidia --dest ~/.cache/huggingface/hub`
+
+- **Browser:** If you can download from the [catalog page](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/models/cosmos-reason2-8b?version=1208-fp8-static-kv8) in the browser, save the files into `~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` and set `MODEL_PATH` to that directory.
+
+- **Different machine:** If the same API key and `NGC_CLI_ORG=nvidia` work on one host but not another, the failing host may differ by network, NGC CLI version, or backend. Run with **`--debug`** to see the underlying error: `ngc --debug registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub`. Reliable workaround: **copy the model from the working machine** (e.g. from jat03): `rsync -avz jetson@jat03-iso384:~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/ ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` then set `MODEL_PATH` to that directory.
+
+
 ### Option C: OpenAI API
 
 No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide your API key, and choose a model (`gpt-4o` for vision, `gpt-4o-mini` for text).
@@ -395,9 +537,9 @@ No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide
 # Ollama
 curl -s http://localhost:11434/v1/models | python3 -m json.tool
 
-# vLLM
-curl -s http://localhost:8000/v1/models | python3 -m json.tool
-curl -s http://localhost:8000/health && echo "READY" || echo "NOT READY"
+# vLLM (use your port if different, e.g. 8010 when Riva uses 8000)
+curl -s http://localhost:8010/v1/models | python3 -m json.tool
+curl -s http://localhost:8010/health && echo "READY" || echo "NOT READY"
 ```
 
 ### Using Vision

diff --git a/README.md b/README.md
@@ -2,50 +2,47 @@
 
 ![](./docs/images/screenshot_example_2.png)
 
-**Voice, Text, and Video AI Interface with Advanced Performance Analysis**
+**Voice, text, and video conversational AI with session analysis and latency metrics**
 
-Multi-modal AI Studio is a next-generation conversational AI interface designed for analyzing and optimizing voice AI systems. Built on NVIDIA Riva, OpenAI APIs, and other backends, it features sophisticated session management, real-time timeline visualization, and comprehensive latency metrics.
+Multi-modal AI Studio is a conversational AI interface for building and tuning voice AI systems. It supports NVIDIA Riva, OpenAI, and other backends; records sessions with full config snapshots; and provides a real-time timeline and latency analysis (TTFA, turn-taking) to compare and optimize setups.
 
 ## 🌟 Key Features
 
 ### Multi-modal Support
-- **Voice Input/Output**: Streaming ASR and TTS via Riva or OpenAI
-- **Text Chat**: Traditional text-based conversation
-- **Video**: Camera feed for vision-enabled models (future)
-- **Mixed Modes**: Voice-to-text, text-to-voice, or text-only
+- **Voice**: Streaming ASR and TTS (Riva, OpenAI, or other backends)
+- **Text**: Chat-only mode or combined with voice
+- **Video**: Camera feed for vision-language models (VLM); browser WebRTC or server USB webcam
+- **Mixed modes**: Voice-to-text, text-to-voice, voice-to-voice, or text-only
 
 ### Multi-backend Architecture
-- **NVIDIA Riva**: gRPC streaming ASR/TTS
-- **OpenAI**: REST API (Whisper, TTS) and Realtime API
-- **Azure Speech**: Coming soon
-- **Custom backends**: Extensible plugin system
+- Speech
+  - **NVIDIA Riva**: gRPC streaming ASR/TTS (Jetson/ARM64)
+  - **OpenAI-compatible Realtime API**: Realtime API
+- LLM: **OpenAI-compatible** REST API, to works with many inference engines for various LLM/VLM models
+- **Extensible**: Plugin-style backends; Azure Speech and others can be added
 
 ### Session Management
-- **Configuration Snapshots**: Every session saves ASR/LLM/TTS configs
-- **Timeline Recording**: Store performance data for offline analysis
-- **Preset System**: Save and load configuration presets
-- **Export/Import**: Generate CLI commands or YAML configs from WebUI
+- **Config snapshots**: Every session stores ASR/LLM/TTS and device settings
+- **Timeline recording**: Performance data for offline analysis
+- **Presets**: Save and load configuration presets
 
 ### Performance Analysis
-- **Real-time Timeline**: Multi-lane visualization (Audio, Speech, LLM, TTS)
-- **Latency Metrics**: TTFA (Time to First Audio), turn-taking analysis
-- **Comparison Mode**: Compare multiple sessions to optimize configs
-- **Session Replay**: Analyze recorded timeline data
+- **Real-time timeline**: Multi-lane view (Audio, Speech, LLM, TTS)
+- **Latency metrics**: TTFA (Time to First Audio), turn-taking
 
-### Flexible Deployment
-- **WebUI Mode**: Rich browser interface (default)
-- **Headless Mode**: CLI-only for production/automation (not yet implemented)
-- **Audio/Video devices**: **Currently supported:** browser devices via WebRTC (mic, speaker, camera through the browser). **Not yet supported:** local USB microphone, USB speaker, or USB webcam attached to the server machine.
+### UI & Devices
+- **Chat-style UI**: Familiar layout, video full-screen mode, keyboard shortcuts. Most settings are exposed in the UI (ASR/LLM/TTS, models, devices) so you can tweak and switch backends without editing config files or code.
+- **Devices**: Client-side (browser WebRTC) and server-side (Linux USB mic, USB speaker, USB webcam); choose in the Devices tab.
+- **Headless** (experimental, not well tested): CLI with config file or args; see [INSTALL.md](INSTALL.md).
 
 ## 🚀 Quick Start
 
 ### Prerequisites
 
-- Python 3.8+
-- **Audio/video**: Use the app in a browser; mic, speaker, and camera are accessed via WebRTC (browser devices). Local USB mic/speaker/webcam on the server are not supported yet.
-- NVIDIA Riva (for Riva backend) - see [INSTALL.md](INSTALL.md#nvidia-riva-setup-for-voice-asrtts)
-- OpenAI API key (for OpenAI backend) - optional
-- **Optional**: `jq` (e.g. `apt install jq` or `brew install jq`) for pretty-formatted LLM request/response logs in the server console; without it, logs use plain JSON
+- **Python 3.8+**
+- **Audio/video**: Browser (WebRTC) for mic, speaker, and camera. On Linux, server **USB microphone**, **USB speaker**, and **USB webcam** are also supported; see [INSTALL.md](INSTALL.md).
+- **Backends (as needed)**: [NVIDIA Riva](INSTALL.md#nvidia-riva-setup-for-voice-asrtts) for ASR/TTS; OpenAI API key for OpenAI/Realtime backends (optional).
+- **Optional**: `jq` for pretty-printed LLM logs in the console (`apt install jq` or `brew install jq`).
 
 ### Installation
 
@@ -72,27 +69,9 @@ Full steps and troubleshooting: [INSTALL.md](INSTALL.md)
 ```bash
 # View sessions and timeline (no backend required)
 python -m multi_modal_ai_studio --port 8092
-
-# With Riva ASR/TTS (use --asr-server and --tts-server)
-python -m multi_modal_ai_studio \
-  --port 8092 \
-  --asr-server localhost:50051 \
-  --tts-server localhost:50051 \
-  --llm-api-base http://localhost:11434/v1 \
-  --llm-model llama3.2:3b
-
-# With OpenAI Realtime API
-python -m multi_modal_ai_studio \
-  --port 8092 \
-  --asr-scheme openai-realtime \
-  --tts-scheme openai-realtime \
-  --llm-api-key sk-...
-
-# With preset
-python -m multi_modal_ai_studio --preset low-latency
 ```
 
-Open **http://localhost:8092** in your browser.
+Open **http://localhost:8092** in your browser. For voice (Riva, OpenAI, etc.) and other options, see [INSTALL.md](INSTALL.md).
 
 ### Kill a Running Server
 
@@ -107,30 +86,31 @@ lsof -i :8092
 kill <PID>
 ```
 
-**Sessions and sample data**
-By default the app loads and saves sessions in `sessions/`. To view or use the sample/mock session JSONs (e.g. in `mock_sessions/`), run with `--session-dir mock_sessions`. Open the app, then click a session in the sidebar to view its config and timeline.
+### Sessions and sample data
+
+Sessions are stored in `sessions/` by default. To try sample timelines, run with `--session-dir mock_sessions` and open a session from the sidebar.
 
-### Run Headless
+### Run headless (experimental)
+
+CLI-only mode for automation or local audio devices. Requires the `[audio]` extra and device setup; see [INSTALL.md](INSTALL.md).
 
 ```bash
-# From config file
 python -m multi_modal_ai_studio --mode headless --config my-config.yaml
 
-# From CLI args
-python -m multi_modal_ai_studio \
-  --mode headless \
-  --audio-input alsa:hw:0,0 \
-  --audio-output alsa:hw:1,0 \
-  --asr-scheme riva \
-  --llm-model llama3.2:3b
+# Or with CLI args (e.g. ALSA devices)
+python -m multi_modal_ai_studio --mode headless \
+  --audio-input alsa:hw:0,0 --audio-output alsa:hw:1,0 \
+  --asr-scheme riva --llm-model llama3.2:3b
 ```
 
 ## 📖 Documentation
 
-- [VLM Guide](docs/vlm_guide.md) — Vision-Language Model setup, input modes, frame capture, and tuning
-- [Riva Setup](docs/setup_riva.md) — NVIDIA Riva ASR/TTS installation and configuration
-- [Architecture](docs/architecture.md) — System design and component overview
-- [Installation](INSTALL.md) — Full installation steps and troubleshooting
+| Doc | Description |
+|-----|-------------|
+| [INSTALL.md](INSTALL.md) | Installation, backends, and troubleshooting |
+| [Riva Setup](docs/setup_riva.md) | NVIDIA Riva ASR/TTS (Jetson/ARM64) |
+| [VLM Guide](docs/vlm_guide.md) | Vision-language models, frame capture, tuning |
+| [Architecture](docs/architecture.md) | System design and components |
 
 ## 🤝 Contributing
 
@@ -140,6 +120,3 @@ This project is under active development. Issues, pull requests, and feedback ar
 
 Apache License 2.0 - See [LICENSE](LICENSE) file for details.
 
-## 🙏 Acknowledgments
-
-Built on top of proven concepts from [Live RIVA WebUI](https://github.com/yourusername/live-riva-webui).