diff --git a/INSTALL.md b/INSTALL.md
index 60e2053..00fecc5 100644
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -361,15 +361,18 @@ Ollama serves on `http://localhost:11434/v1` by default. In the UI, set:
 
 vLLM provides high-throughput serving with GPU acceleration. Example with Cosmos-Reason2 for vision:
 
-**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images.
+**[Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/)** — full setup including model download and platform-specific Docker images. The FP8 model is downloaded from NGC; you need an NGC account with access to the **nim** org (and often the **nvidia** team). If NGC download fails, see [NGC Cosmos model download fails](#ngc-cosmos-model-download-fails-completed-0-failed-n) below.
 
 Quick reference for Jetson Thor (after downloading the FP8 model per the link above):
 
 ```bash
 export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
 
+mkdir -p ~/.cache/vllm
+sudo sysctl -w vm.drop_caches=3
 sudo docker run -it --rm --runtime=nvidia --network host \
   -v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
+  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
   ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
   vllm serve /models/cosmos-reason2-8b \
     --served-model-name nvidia/cosmos-reason2-8b-fp8 \
@@ -378,13 +381,152 @@ sudo docker run -it --rm --runtime=nvidia --network host \
     --reasoning-parser qwen3 \
     --media-io-kwargs '{"video": {"num_frames": -1}}' \
     --enable-prefix-caching \
-    --port 8000
+    --port 8010
 ```
 
+The second volume `-v ${HOME}/.cache/vllm:/root/.cache/vllm` persists vLLM’s **torch.compile cache** on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create `~/.cache/vllm` **before** the first run (as in the example above) so it is owned by your user; otherwise the container may create it as root and you can hit permission issues later.
+
+`vm.drop_caches=3` frees **system (CPU) memory** (page cache, etc.); it does **not** free **GPU VRAM**. If you start vLLM a second time while the first container is still running, the GPU has no free VRAM and vLLM will fail with "Free memory on device cuda:0 (...) is less than desired". **Stop the first vLLM container** (e.g. Ctrl+C or `docker stop`) so the driver releases GPU memory, then start again.
+
+> **Port conflict with Riva**: The **Riva container** exposes ports **8000–8002** (and 8888, 50051). If you run both Riva and vLLM on the same machine, use a different vLLM port so they don't clash. The example above uses `--port 8010`; in the app set **LLM API Base** to `http://localhost:8010/v1`. If Riva is not running, `--port 8000` is fine.
+>
 > **Memory tuning**: On shared-memory systems (Jetson), lower `--gpu-memory-utilization` to leave room for the OS, Riva, and the application. On discrete GPUs with dedicated VRAM, `0.8` is safe.
 >
 > **Desktop GPU / x86_64**: Use `vllm/vllm-openai:latest` or `nvcr.io/nvidia/vllm:latest` instead of the Jetson image.
 
+### vLLM troubleshooting
+
+#### `OSError: [Errno 98] Address already in use`
+
+vLLM fails at startup with `sock.bind(addr) OSError: [Errno 98] Address already in use` when the API port (default **8000**) is already taken—for example by a previous vLLM run, another container, or another service.
+
+**1. Find what is using the port**
+
+```bash
+# Default vLLM port is 8000; use your --port if different
+lsof -i :8000
+# or
+ss -tlnp | grep 8000
+# or
+fuser 8000/tcp
+```
+
+If **`ss` shows port 8000 in LISTEN but `lsof` and `fuser` show no PID**, the process is usually **inside a Docker container**. List containers and look for one that has port 8000:
+
+```bash
+docker ps -a
+# Look for a container with 0.0.0.0:8000->8000/tcp or similar in PORTS
+```
+
+**2. Free the port or use another**
+
+- **Riva is using 8000** (container `riva-speech` exposes 8000–8002): Don't stop Riva. Start vLLM on a different port and point the app to it:
+  ```bash
+  # In the vllm serve command, use e.g.:
+  --port 8010
+
+  # In Multi-modal AI Studio, set LLM API Base to:
+  # http://localhost:8010/v1
+  ```
+- **Another Docker container** (e.g. leftover vLLM): Stop and remove it if you don't need it:
+  ```bash
+  docker ps -a
+  docker stop <container_id_or_name>
+  docker rm <container_id_or_name>
+  # or: docker rm -f <container_id_or_name>
+  ```
+- **Process on the host** (when lsof/fuser show a PID): Kill it:
+  ```bash
+  kill <PID>
+  # or: fuser -k 8000/tcp
+  ```
+
+**3. Use a different port**
+
+If you need to keep whatever is on 8000, start vLLM with `--port 8010` (or another free port) and set the app's **LLM API Base** to `http://localhost:8010/v1`.
+
+#### `ValueError: Free memory on device cuda:0 (...) is less than desired GPU memory utilization`
+
+Another process (often a **previous vLLM container**) is still using the GPU, so there isn’t enough free VRAM. Stop the other process: if the first vLLM was started in another terminal, press **Ctrl+C** there, or run `docker ps` and `docker stop <container_id>`. The driver may take **30–60 seconds** to release VRAM after the container exits; run `nvidia-smi` and wait until free memory is back to normal before starting vLLM again. `vm.drop_caches=3` only frees system RAM, not GPU VRAM.
+
+#### `ValidationError: Invalid repository ID or local directory specified: '/models/...'`
+
+vLLM fails during startup with a message like **Invalid repository ID or local directory specified: '/models/cosmos-reason2-8b'** when the model path inside the container is missing, wrong, or doesn't contain the expected config files.
+
+**1. Check the model directory on the host**
+
+Ensure `MODEL_PATH` points to the directory that contains the model files (e.g. `config.json` for Hugging Face–style models):
+
+```bash
+echo $MODEL_PATH
+ls -la "$MODEL_PATH"
+# Must contain at least: config.json (and usually model weights, tokenizer files, etc.)
+```
+
+If the directory is missing or empty, download the model first (see [Cosmos-Reason2-8B on Jetson AI Lab](https://www.jetson-ai-lab.com/models/cosmos-reason2-8b/) or your model’s instructions).
+
+**2. Check the volume mount**
+
+The `docker run` command must mount that host path into the container path vLLM uses:
+
+```bash
+# Example: host path -> container path /models/cosmos-reason2-8b
+-v $MODEL_PATH:/models/cosmos-reason2-8b:ro
+```
+
+- Use an **absolute path** for `MODEL_PATH` (e.g. `$HOME/.cache/huggingface/hub/...`), not a relative one, so the mount is correct from any working directory.
+- The path after the colon must match the path you pass to `vllm serve` (e.g. `vllm serve /models/cosmos-reason2-8b`).
+
+**3. Verify the container sees the files**
+
+Run a quick check that the mounted directory exists and has a config inside the container:
+
+```bash
+docker run --rm -v "$MODEL_PATH:/models/cosmos-reason2-8b:ro" \
+  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
+  ls -la /models/cosmos-reason2-8b
+```
+
+You should see `config.json` and other model files. If the list is empty or "No such file or directory", fix `MODEL_PATH` or the mount path and try again.
+
+#### Fix Hugging Face cache permissions (root-owned)
+
+If `~/.cache/huggingface` or `~/.cache/huggingface/hub` is owned by **root** (e.g. created by [jetson-containers](https://github.com/dusty-nv/jetson-containers) or another tool running with `sudo`), commands run as your user (NGC CLI, Python, Hugging Face libraries) will get **Permission denied** when writing there.
+
+**Fix:** make the cache tree owned by the current user:
+
+```bash
+sudo chown -R $USER:$USER ~/.cache/huggingface
+```
+
+Then retry the download or command that was failing. To avoid the issue in the future, create the directory as your user before any tool that might run as root: `mkdir -p ~/.cache/huggingface/hub`.
+
+#### NGC Cosmos model download fails (Completed: 0, Failed: N)
+
+If `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub` fails with **Completed: 0, Failed: 14**:
+
+- **Permission denied when writing files:** If the debug log shows `[Errno 13] Permission denied` for paths under `~/.cache/huggingface/hub/`, the destination is likely **root-owned**. See [Fix Hugging Face cache permissions (root-owned)](#fix-hugging-face-cache-permissions-root-owned) above: run `sudo chown -R $USER:$USER ~/.cache/huggingface`, then retry. If you only need to fix the model subdirectory: `sudo chown -R $USER:$USER ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` (and ensure the parent `hub` is writable). Alternatively remove the partial dir and re-download: `rm -rf ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8` then run the same `ngc registry model download-version ...` again, or use a different `--dest` you can write to.
+
+- **403 or auth/org errors:** If the debug log shows **403** or org/entitlement errors (rather than Permission denied), try setting the effective org to **nvidia**. The NGC CLI uses **`NGC_CLI_ORG`** from the environment. Example for `~/.bashrc`:
+  ```bash
+  export NGC_CLI_ORG=nvidia
+  # optional: export NGC_CLI_API_KEY=<your-key>
+  ```
+  Then in the same shell (or a new terminal after `source ~/.bashrc`):
+  ```bash
+  ngc config current   # check effective org
+  ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub
+  export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
+  ```
+  If the env var is not picked up, use `--org nvidia` on the command (next bullet). The download often succeeds with the default org (e.g. with `NGC_CLI_ORG` unset); only try `nvidia` if you see 403 or org/entitlement errors.
+
+- **Explicit org/team:** `ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --org nim --team nvidia --dest ~/.cache/huggingface/hub`
+
+- **Browser:** If you can download from the [catalog page](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/models/cosmos-reason2-8b?version=1208-fp8-static-kv8) in the browser, save the files into `~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` and set `MODEL_PATH` to that directory.
+
+- **Different machine:** If the same API key and `NGC_CLI_ORG=nvidia` work on one host but not another, the failing host may differ by network, NGC CLI version, or backend. Run with **`--debug`** to see the underlying error: `ngc --debug registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" --dest ~/.cache/huggingface/hub`. Reliable workaround: **copy the model from the working machine** (e.g. from jat03): `rsync -avz jetson@jat03-iso384:~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/ ~/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8/` then set `MODEL_PATH` to that directory.
+
+
 ### Option C: OpenAI API
 
 No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide your API key, and choose a model (`gpt-4o` for vision, `gpt-4o-mini` for text).
@@ -395,9 +537,9 @@ No local setup needed. Set **API Base** to `https://api.openai.com/v1`, provide
 # Ollama
 curl -s http://localhost:11434/v1/models | python3 -m json.tool
 
-# vLLM
-curl -s http://localhost:8000/v1/models | python3 -m json.tool
-curl -s http://localhost:8000/health && echo "READY" || echo "NOT READY"
+# vLLM (use your port if different, e.g. 8010 when Riva uses 8000)
+curl -s http://localhost:8010/v1/models | python3 -m json.tool
+curl -s http://localhost:8010/health && echo "READY" || echo "NOT READY"
 ```
 
 ### Using Vision
diff --git a/README.md b/README.md
index 04adad3..29b09b3 100644
--- a/README.md
+++ b/README.md
@@ -2,50 +2,47 @@
 
 ![](./docs/images/screenshot_example_2.png)
 
-**Voice, Text, and Video AI Interface with Advanced Performance Analysis**
+**Voice, text, and video conversational AI with session analysis and latency metrics**
 
-Multi-modal AI Studio is a next-generation conversational AI interface designed for analyzing and optimizing voice AI systems. Built on NVIDIA Riva, OpenAI APIs, and other backends, it features sophisticated session management, real-time timeline visualization, and comprehensive latency metrics.
+Multi-modal AI Studio is a conversational AI interface for building and tuning voice AI systems. It supports NVIDIA Riva, OpenAI, and other backends; records sessions with full config snapshots; and provides a real-time timeline and latency analysis (TTFA, turn-taking) to compare and optimize setups.
 
 ## 🌟 Key Features
 
 ### Multi-modal Support
-- **Voice Input/Output**: Streaming ASR and TTS via Riva or OpenAI
-- **Text Chat**: Traditional text-based conversation
-- **Video**: Camera feed for vision-enabled models (future)
-- **Mixed Modes**: Voice-to-text, text-to-voice, or text-only
+- **Voice**: Streaming ASR and TTS (Riva, OpenAI, or other backends)
+- **Text**: Chat-only mode or combined with voice
+- **Video**: Camera feed for vision-language models (VLM); browser WebRTC or server USB webcam
+- **Mixed modes**: Voice-to-text, text-to-voice, voice-to-voice, or text-only
 
 ### Multi-backend Architecture
-- **NVIDIA Riva**: gRPC streaming ASR/TTS
-- **OpenAI**: REST API (Whisper, TTS) and Realtime API
-- **Azure Speech**: Coming soon
-- **Custom backends**: Extensible plugin system
+- Speech
+  - **NVIDIA Riva**: gRPC streaming ASR/TTS (Jetson/ARM64)
+  - **OpenAI-compatible Realtime API**: Realtime API
+- LLM: **OpenAI-compatible** REST API, to works with many inference engines for various LLM/VLM models
+- **Extensible**: Plugin-style backends; Azure Speech and others can be added
 
 ### Session Management
-- **Configuration Snapshots**: Every session saves ASR/LLM/TTS configs
-- **Timeline Recording**: Store performance data for offline analysis
-- **Preset System**: Save and load configuration presets
-- **Export/Import**: Generate CLI commands or YAML configs from WebUI
+- **Config snapshots**: Every session stores ASR/LLM/TTS and device settings
+- **Timeline recording**: Performance data for offline analysis
+- **Presets**: Save and load configuration presets
 
 ### Performance Analysis
-- **Real-time Timeline**: Multi-lane visualization (Audio, Speech, LLM, TTS)
-- **Latency Metrics**: TTFA (Time to First Audio), turn-taking analysis
-- **Comparison Mode**: Compare multiple sessions to optimize configs
-- **Session Replay**: Analyze recorded timeline data
+- **Real-time timeline**: Multi-lane view (Audio, Speech, LLM, TTS)
+- **Latency metrics**: TTFA (Time to First Audio), turn-taking
 
-### Flexible Deployment
-- **WebUI Mode**: Rich browser interface (default)
-- **Headless Mode**: CLI-only for production/automation (not yet implemented)
-- **Audio/Video devices**: **Currently supported:** browser devices via WebRTC (mic, speaker, camera through the browser). **Not yet supported:** local USB microphone, USB speaker, or USB webcam attached to the server machine.
+### UI & Devices
+- **Chat-style UI**: Familiar layout, video full-screen mode, keyboard shortcuts. Most settings are exposed in the UI (ASR/LLM/TTS, models, devices) so you can tweak and switch backends without editing config files or code.
+- **Devices**: Client-side (browser WebRTC) and server-side (Linux USB mic, USB speaker, USB webcam); choose in the Devices tab.
+- **Headless** (experimental, not well tested): CLI with config file or args; see [INSTALL.md](INSTALL.md).
 
 ## 🚀 Quick Start
 
 ### Prerequisites
 
-- Python 3.8+
-- **Audio/video**: Use the app in a browser; mic, speaker, and camera are accessed via WebRTC (browser devices). Local USB mic/speaker/webcam on the server are not supported yet.
-- NVIDIA Riva (for Riva backend) - see [INSTALL.md](INSTALL.md#nvidia-riva-setup-for-voice-asrtts)
-- OpenAI API key (for OpenAI backend) - optional
-- **Optional**: `jq` (e.g. `apt install jq` or `brew install jq`) for pretty-formatted LLM request/response logs in the server console; without it, logs use plain JSON
+- **Python 3.8+**
+- **Audio/video**: Browser (WebRTC) for mic, speaker, and camera. On Linux, server **USB microphone**, **USB speaker**, and **USB webcam** are also supported; see [INSTALL.md](INSTALL.md).
+- **Backends (as needed)**: [NVIDIA Riva](INSTALL.md#nvidia-riva-setup-for-voice-asrtts) for ASR/TTS; OpenAI API key for OpenAI/Realtime backends (optional).
+- **Optional**: `jq` for pretty-printed LLM logs in the console (`apt install jq` or `brew install jq`).
 
 ### Installation
 
@@ -72,27 +69,9 @@ Full steps and troubleshooting: [INSTALL.md](INSTALL.md)
 ```bash
 # View sessions and timeline (no backend required)
 python -m multi_modal_ai_studio --port 8092
-
-# With Riva ASR/TTS (use --asr-server and --tts-server)
-python -m multi_modal_ai_studio \
-  --port 8092 \
-  --asr-server localhost:50051 \
-  --tts-server localhost:50051 \
-  --llm-api-base http://localhost:11434/v1 \
-  --llm-model llama3.2:3b
-
-# With OpenAI Realtime API
-python -m multi_modal_ai_studio \
-  --port 8092 \
-  --asr-scheme openai-realtime \
-  --tts-scheme openai-realtime \
-  --llm-api-key sk-...
-
-# With preset
-python -m multi_modal_ai_studio --preset low-latency
 ```
 
-Open **http://localhost:8092** in your browser.
+Open **http://localhost:8092** in your browser. For voice (Riva, OpenAI, etc.) and other options, see [INSTALL.md](INSTALL.md).
 
 ### Kill a Running Server
 
@@ -107,30 +86,31 @@ lsof -i :8092
 kill <PID>
 ```
 
-**Sessions and sample data**
-By default the app loads and saves sessions in `sessions/`. To view or use the sample/mock session JSONs (e.g. in `mock_sessions/`), run with `--session-dir mock_sessions`. Open the app, then click a session in the sidebar to view its config and timeline.
+### Sessions and sample data
+
+Sessions are stored in `sessions/` by default. To try sample timelines, run with `--session-dir mock_sessions` and open a session from the sidebar.
 
-### Run Headless
+### Run headless (experimental)
+
+CLI-only mode for automation or local audio devices. Requires the `[audio]` extra and device setup; see [INSTALL.md](INSTALL.md).
 
 ```bash
-# From config file
 python -m multi_modal_ai_studio --mode headless --config my-config.yaml
 
-# From CLI args
-python -m multi_modal_ai_studio \
-  --mode headless \
-  --audio-input alsa:hw:0,0 \
-  --audio-output alsa:hw:1,0 \
-  --asr-scheme riva \
-  --llm-model llama3.2:3b
+# Or with CLI args (e.g. ALSA devices)
+python -m multi_modal_ai_studio --mode headless \
+  --audio-input alsa:hw:0,0 --audio-output alsa:hw:1,0 \
+  --asr-scheme riva --llm-model llama3.2:3b
 ```
 
 ## 📖 Documentation
 
-- [VLM Guide](docs/vlm_guide.md) — Vision-Language Model setup, input modes, frame capture, and tuning
-- [Riva Setup](docs/setup_riva.md) — NVIDIA Riva ASR/TTS installation and configuration
-- [Architecture](docs/architecture.md) — System design and component overview
-- [Installation](INSTALL.md) — Full installation steps and troubleshooting
+| Doc | Description |
+|-----|-------------|
+| [INSTALL.md](INSTALL.md) | Installation, backends, and troubleshooting |
+| [Riva Setup](docs/setup_riva.md) | NVIDIA Riva ASR/TTS (Jetson/ARM64) |
+| [VLM Guide](docs/vlm_guide.md) | Vision-language models, frame capture, tuning |
+| [Architecture](docs/architecture.md) | System design and components |
 
 ## 🤝 Contributing
 
@@ -140,6 +120,3 @@ This project is under active development. Issues, pull requests, and feedback ar
 
 Apache License 2.0 - See [LICENSE](LICENSE) file for details.
 
-## 🙏 Acknowledgments
-
-Built on top of proven concepts from [Live RIVA WebUI](https://github.com/yourusername/live-riva-webui).
diff --git a/docs/setup_riva.md b/docs/setup_riva.md
index 391b3ff..74252d8 100644
--- a/docs/setup_riva.md
+++ b/docs/setup_riva.md
@@ -21,13 +21,57 @@ This guide walks through setting up NVIDIA Riva locally for voice (ASR/TTS). It
 
 - **Jetson Platform**: Jetson Orin, Thor, AGX Xavier, or newer (ARM64/L4T)
 - **JetPack**: Recent JetPack version (6.0+ recommended)
-- **Docker + NVIDIA Container Toolkit**: Pre-installed on JetPack
+- **Docker + NVIDIA Container Toolkit**: Pre-installed on JetPack. The ARM64 quickstart uses **plain Docker only** (`docker run`, `docker exec`, etc.) — **Docker Compose is not required**.
 - **NGC account with Riva access**: Required for downloading Riva resources
   - Try your account that has Riva entitlements (company or personal)
   - NVIDIA employees: Internal access may require specific team membership
   - External users: May need AI Enterprise trial or proper entitlements
   - **Tip**: If one account doesn't work, try another you have access to
 
+## Configure Docker for GPU (Jetson) — do this first
+
+Riva runs in a container that needs GPU access. On Jetson, Docker must use the **NVIDIA Container Runtime** and be **restarted** after any config change. Doing this once at the start avoids the "container stays Created" / "use --runtime=nvidia instead" errors.
+
+1. **Ensure `/etc/docker/daemon.json` has the NVIDIA runtime and default**
+
+   If the file doesn't exist or is empty, create it. Otherwise merge the `runtimes` and `default-runtime` into your existing config:
+
+   ```json
+   {
+     "runtimes": {
+       "nvidia": {
+         "path": "nvidia-container-runtime",
+         "runtimeArgs": []
+       }
+     },
+     "default-runtime": "nvidia"
+   }
+   ```
+
+   Example (create or edit with sudo):
+
+   ```bash
+   sudo nano /etc/docker/daemon.json
+   ```
+
+2. **Restart Docker so the config is applied**
+
+   **This step is required.** Changes to `daemon.json` do not apply until Docker is restarted.
+
+   ```bash
+   sudo systemctl restart docker
+   ```
+
+3. **Optional: verify GPU access in a container**
+
+   ```bash
+   docker run --rm --runtime=nvidia nvcr.io/nvidia/cuda:13.0.0-runtime-ubuntu24.04 nvidia-smi
+   ```
+
+   You should see your GPU; if not, check NVIDIA Container Toolkit and JetPack install.
+
+Then continue with Part 1 (NGC CLI) below.
+
 ## Part 1: Install NGC CLI
 
 The NGC CLI is required to download Riva's quickstart bundle from NVIDIA's catalog.
@@ -285,30 +329,28 @@ Preparing model repository...
 
 ## Part 6: Start Riva Services (Jetson)
 
+Run from inside the quickstart directory (so `config.sh` is found):
+
 ```bash
 cd riva_quickstart_arm64_v2.24.0
 bash riva_start.sh
 ```
 
-This launches the Riva server via Docker Compose. Services:
-- **riva-speech**: gRPC server on port `50051` (ASR/TTS)
-- **riva-client**: Client container with sample scripts and test files
+This launches the Riva server via **Docker** (the script uses `docker run`; no Docker Compose). One container is started:
+- **riva-speech**: gRPC server on port `50051` (ASR/TTS). A client shell or sample scripts are available separately via `riva_start_client.sh` (see Part 7).
 
 **Note for USB audio**: If using USB microphone/speaker, connect it **before** running `riva_start.sh`. The script will automatically mount it into the container.
 
 ### Verify Deployment
 
 ```bash
-# Check container status
-docker compose ps
-
-# Expected output:
-# NAME                  STATUS
-# riva-speech           Up X minutes
-# riva-client           Up X minutes
+# Check that the riva-speech container is running (name comes from config.sh)
+docker ps -f "name=riva-speech"
 
-# Check logs
-docker compose logs -f riva-speech
+# Check logs if anything looks wrong
+docker logs riva-speech
+# Follow logs in real time:
+docker logs -f riva-speech
 ```
 
 Look for successful startup message:
@@ -340,7 +382,7 @@ riva_streaming_asr_client --list_models
 # 'en-US': 'parakeet-1.1b-en-us-asr-streaming'
 ```
 
-**Note**: Riva 2.24.0 on Jetson defaults to **Parakeet 1.1b**, which is optimized for low-latency streaming ASR. This is the recommended model for real-time voice applications like Live RIVA WebUI.
+**Note**: Riva 2.24.0 on Jetson defaults to **Parakeet 1.1b**, which is optimized for low-latency streaming ASR. This is the recommended model for real-time voice applications (e.g. Multi-modal AI Studio, Live RIVA WebUI).
 
 ### Test Streaming ASR (Primary mode for Parakeet 1.1b)
 
@@ -384,7 +426,7 @@ Throughput: 8.3569e+00 RTFX
 **Streaming ASR is the primary mode for Riva 2.24.0 on Jetson**:
 - Low latency (~100-200ms)
 - Real-time interim results
-- Optimized for conversational AI applications like Live RIVA WebUI
+- Optimized for conversational AI applications (e.g. Multi-modal AI Studio, Live RIVA WebUI)
 
 ### Test with Opus file (WebRTC codec)
 
@@ -456,7 +498,7 @@ This stops and removes containers while preserving downloaded models in the `riv
 
 ```
 ┌─────────────────────────────────────────────────────────────┐
-│                     Live RIVA WebUI                         │
+│       Voice apps (Multi-modal AI Studio / Live RIVA WebUI)  │
 │  ┌──────────────┐         ┌──────────────┐                  │
 │  │   Browser    │◄───────►│  WebUI       │                  │
 │  │  (WebRTC)    │  WS/RTC │  Server      │                  │
@@ -465,7 +507,7 @@ This stops and removes containers while preserving downloaded models in the `riv
 ├───────────────────────────────────┼─────────────────────────┤
 │                    Docker         │                         │
 │  ┌────────────────────────────────▼───────────────────┐     │
-│  │         riva-speech-api (port 50051)               │     │
+│  │         riva-speech (port 50051)                   │     │
 │  │  ┌────────────┐  ┌────────────┐  ┌────────────┐    │     │
 │  │  │    ASR     │  │    TTS     │  │    NMT     │    │     │
 │  │  │ StreamingR │  │ Synthesize │  │ Translate  │    │     │
@@ -484,11 +526,11 @@ This stops and removes containers while preserving downloaded models in the `riv
 
 ## Part 9: WebRTC and Opus Audio
 
-### Why Opus matters for Live RIVA WebUI
+### Why Opus matters for voice applications
 
-**Opus is WebRTC's standard audio codec** - all modern browsers encode microphone audio as Opus by default. Riva's inclusion of Opus sample files (`/opt/riva/wav/en-US_sample.opus`) confirms it can handle this codec natively.
+**Opus is WebRTC's standard audio codec** — all modern browsers encode microphone audio as Opus by default. Riva's inclusion of Opus sample files (`/opt/riva/wav/en-US_sample.opus`) confirms it can handle this codec natively.
 
-For Live RIVA WebUI, the audio flow will be:
+For Multi-modal AI Studio and Live RIVA WebUI, the audio flow is:
 ```
 Browser (WebRTC) → Opus audio → WebSocket → Bridge → PCM → Riva gRPC → Transcripts
 ```
@@ -506,13 +548,13 @@ NVIDIA provides an **open-source WebSocket ↔ Riva bridge**: [nvidia-riva/webso
 
 **Implementation**: JavaScript/Node.js
 
-For Live RIVA WebUI, we can either:
+For Multi-modal AI Studio and Live RIVA WebUI, options include:
 1. Use the nvidia-riva/websocket-bridge as-is
-2. Build a Python version integrated into our existing async server (reusing Live VLM WebUI's WebRTC scaffolding)
+2. Build a Python version integrated into the existing async server (e.g. reusing Live VLM WebUI's WebRTC scaffolding)
 
-## Next Steps for Live RIVA WebUI
+## Next Steps (Multi-modal AI Studio / Live RIVA WebUI)
 
-1. **Audio Bridge**: Build WebSocket/WebRTC → gRPC adapter
+1. **Audio Bridge**: WebSocket/WebRTC → gRPC adapter
    - Accept Opus audio from browser
    - Decode Opus → PCM (or use Riva's native Opus support)
    - Stream to `riva_asr.StreamingRecognize` gRPC
@@ -526,7 +568,7 @@ For Live RIVA WebUI, we can either:
    - `riva_tts.SynthesizeOnline` gRPC
    - Send audio back to browser via WebRTC
 
-4. **UI**: React/TypeScript frontend
+4. **UI**: Web frontend (e.g. React/TypeScript)
    - Mic capture (WebRTC audio)
    - Live captions overlay
    - Chat transcript panel
@@ -536,6 +578,195 @@ For Live RIVA WebUI, we can either:
 
 ## Troubleshooting
 
+### "Waiting for Riva server to load all models... retrying in 10 seconds" (never finishes)
+
+The Riva server container is not becoming healthy within the timeout. The quickstart uses **plain Docker** (no Compose); troubleshoot as follows.
+
+1. **Run from the quickstart directory** (so `config.sh` is loaded)
+   ```bash
+   cd /path/to/riva_quickstart_arm64_v2.24.0
+   bash riva_start.sh
+   ```
+
+2. **Check container status**
+   ```bash
+   docker ps -a -f "name=riva-speech"
+   ```
+   - If **riva-speech** is missing or status is **Exited**: the container failed. Check logs (step 3).
+   - If it is **Up** but the script still retries: health check may be slow (first load can take several minutes), or the server may be failing internally — check logs.
+
+3. **Inspect riva-speech logs**
+   The script suggests: `docker logs riva-speech`. If that shows nothing, see [Health ready check failed and empty logs](#health-ready-check-failed-and-empty-docker-logs-riva-speech).
+   ```bash
+   docker logs riva-speech
+   docker logs --tail=200 riva-speech
+   ```
+   Look for:
+   - **GPU / CUDA errors**: Ensure `nvidia-smi` works and NVIDIA Container Toolkit is installed.
+   - **Model not found / path errors**: Re-run `bash riva_init.sh` and ensure it completed without errors.
+   - **Out of memory**: Jetson may need more swap or fewer models; disable TTS or NLP in `config.sh` to reduce memory.
+
+4. **Restart cleanly**
+   ```bash
+   bash riva_stop.sh
+   bash riva_start.sh
+   ```
+   In another terminal, run `docker logs -f riva-speech` to watch startup output.
+
+### "Health ready check failed" and empty `docker logs riva-speech`
+
+The script suggests `docker logs riva-speech` (the container name is set in `config.sh` as `riva_daemon_speech="riva-speech"`). If that command prints **nothing**, check the container **STATUS** with `docker ps -a -f "name=riva-speech"`:
+
+- **STATUS = Created** → The container was created but **never started** (main process never ran). See [Container stuck in Created](#container-stuck-in-created-never-started) below.
+- **STATUS = Exited** → The process ran then exited; see step 2 below.
+- **STATUS = Up** → Container is running; logs may appear after a short delay, or try `docker logs -f riva-speech`.
+
+1. **Confirm the container exists and its name**
+   ```bash
+   docker ps -a | grep -i riva
+   ```
+   The quickstart creates a container named **riva-speech**. If you see a different name (e.g. from an older run or custom config), use that:
+   ```bash
+   docker logs <container_name_or_id>
+   ```
+
+2. **Container exited immediately**
+   If the container is **Exited**, it may have crashed before writing much. You can still try:
+   ```bash
+   docker logs riva-speech
+   docker logs --tail=200 riva-speech
+   ```
+   Exited containers often keep stdout/stderr; if logs are still empty, the process may have died before any output. Run again and watch in real time:
+   ```bash
+   bash riva_stop.sh
+   bash riva_start.sh
+   ```
+   In a second terminal, as soon as the container starts:
+   ```bash
+   docker logs -f riva-speech
+   ```
+   Look for GPU/CUDA, model path, or OOM errors in the first lines.
+
+### Container stuck in **Created** (never started)
+
+If `docker ps -a -f "name=riva-speech"` shows **STATUS = Created** (and no "Up" time), the container was created by `docker run -d` but the main process never started. There are no logs because the entrypoint hasn't run. Common causes: missing or inaccessible device (e.g. GPU, USB/sound), volume mount failure, or Docker/runtime blocking start.
+
+**Do this:**
+
+1. **Remove the stuck container and try again from the quickstart directory**
+   ```bash
+   docker rm -f riva-speech
+   cd /path/to/riva_quickstart_arm64_v2.24.0
+   bash riva_start.sh
+   ```
+   In a second terminal, watch for the container to go from Created → Up and then stream logs:
+   ```bash
+   watch -n 1 'docker ps -a -f "name=riva-speech"'
+   # When STATUS becomes "Up", run:
+   docker logs -f riva-speech
+   ```
+
+2. **If it stays in Created again**, try starting it manually to see the error:
+   ```bash
+   docker start riva-speech
+   docker logs -f riva-speech
+   ```
+   If `docker start` fails or logs show nothing, inspect the container:
+   ```bash
+   docker inspect riva-speech
+   ```
+   Check **`State.Error`** for the exact message. A very common one on Jetson is below.
+
+4. **If you see: "invoking the NVIDIA Container Runtime Hook directly ... use the NVIDIA Container Runtime (--runtime=nvidia) instead"**
+   See [Riva container stays "Created": use NVIDIA Container Runtime](#riva-container-stays-created-use-nvidia-container-runtime) below.
+
+5. **Verify GPU and devices**
+   The Riva start script mounts `--gpus` and on Tegra also `--device /dev/bus/usb --device /dev/snd`. Ensure:
+   - `nvidia-smi` works and NVIDIA Container Toolkit is installed.
+   - No security profile (e.g. AppArmor) is blocking device access.
+   - If you don't need USB/sound for the server, you could temporarily comment out the extra `--device` flags in `riva_start.sh` to see if the container then starts (for debugging only).
+
+### Container starts then exits immediately (or stays "Created")
+
+If the container goes **Created** and never shows **Up**, or it exits so quickly that `docker logs riva-speech` is empty, the script is hiding the error: it runs `docker run -d ... &> /dev/null`, so all output is discarded. Run the same container **in the foreground** so you see the real error (CUDA, model path, OOM, etc.):
+
+```bash
+cd /path/to/riva_quickstart_arm64_v2.24.0
+source config.sh
+
+# Remove any existing container so we can use the same name
+docker rm -f riva-speech 2>/dev/null
+
+# Same as riva_start.sh but -it (foreground) and no -d; output goes to your terminal
+docker run -it --rm \
+  --init --ipc=host \
+  --gpus "$gpus_to_use" \
+  -p $riva_speech_api_port:$riva_speech_api_port \
+  -p $riva_speech_api_http_port:$riva_speech_api_http_port \
+  -e RIVA_SERVER_HTTP_PORT=$riva_speech_api_http_port \
+  -e "LD_PRELOAD=$ld_preload" \
+  -e "RIVA_API_KEY=$RIVA_API_KEY" \
+  -e "RIVA_API_NGC_ORG=$RIVA_API_NGC_ORG" \
+  -e "RIVA_EULA=$RIVA_EULA" \
+  -v $riva_model_loc:/data \
+  --ulimit memlock=-1 --ulimit stack=67108864 \
+  -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 \
+  $image_speech_api \
+  start-riva --riva-uri=0.0.0.0:$riva_speech_api_port \
+  --asr_service=$service_enabled_asr \
+  --tts_service=$service_enabled_tts \
+  --nlp_service=$service_enabled_nlp
+```
+
+(On Tegra the script also adds `--device /dev/bus/usb --device /dev/snd`; if the command above runs and you need those, add them before `$image_speech_api`.)
+
+- **What you see** is the real failure (e.g. "could not load model", "CUDA error", "No such file", OOM). Fix that and then use `bash riva_start.sh` again.
+- **If it stays in Created** even with this foreground run, the failure is before the process starts (e.g. device or runtime); check `docker events` in another terminal and run the `docker run` above to see the event error.
+
+### Riva container stays "Created": use NVIDIA Container Runtime
+
+If `docker inspect riva-speech` shows in **State.Error** something like:
+
+```text
+invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported.
+Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead: unknown
+```
+
+then Docker on this host is set up to use the **NVIDIA Container Runtime** (full runtime), not the hook used by `--gpus`. The container never starts because the runtime rejects the `--gpus`-based GPU setup.
+
+**Fix A — Configure Docker to use the NVIDIA runtime by default (recommended)**
+Ensure `/etc/docker/daemon.json` has the nvidia runtime and set it as default:
+
+```json
+{
+  "runtimes": {
+    "nvidia": {
+      "path": "nvidia-container-runtime",
+      "runtimeArgs": []
+    }
+  },
+  "default-runtime": "nvidia"
+}
+```
+
+If the file already has `"runtimes": { "nvidia": ... }` but no `"default-runtime": "nvidia"`, add that. Then:
+
+```bash
+sudo systemctl restart docker
+```
+
+After that, run `bash riva_start.sh` again.
+
+**Fix B — Workaround: use `--runtime=nvidia` in the start script**
+If you prefer not to change the default runtime, patch `riva_start.sh` so the container uses the nvidia runtime on Tegra instead of `--gpus`:
+
+1. Open `riva_start.sh` in your quickstart directory.
+2. Find the line: `--gpus '"'$gpus_to_use'"' \`
+3. Replace it with: `--runtime=nvidia \`
+   (This is safe for Jetson/Tegra; the nvidia runtime gives the container GPU access.)
+
+Then run `bash riva_start.sh` again.
+
 ### "403 Forbidden" when downloading quickstart
 
 - **Cause**: NGC account lacks Riva entitlement
@@ -552,7 +783,7 @@ For Live RIVA WebUI, we can either:
 
 - **Verify GPU**: `nvidia-smi` should show your GPU
 - **Check toolkit**: `docker run --rm --gpus all ubuntu nvidia-smi`
-- **Review logs**: `docker compose logs riva-speech-api`
+- **Review logs**: `docker logs riva-speech` (container name from `config.sh`: `riva_daemon_speech`)
 
 ### Models downloading very slowly
 
@@ -570,7 +801,7 @@ For Live RIVA WebUI, we can either:
 ---
 
 **Document Status**: Updated for Jetson ARM64 deployment (x86 support discontinued)
-**Last Updated**: January 2025
+**Last Updated**: March 2025
 **Riva Version**: 2.24.0 (ARM64)
 **Platform**: NVIDIA Jetson Thor (JAT03)
 
diff --git a/presets/cosmos-reason.yaml b/presets/cosmos-reason.yaml
index 2d2966f..99c4850 100644
--- a/presets/cosmos-reason.yaml
+++ b/presets/cosmos-reason.yaml
@@ -15,7 +15,7 @@ asr:
 
 llm:
   scheme: openai
-  api_base: http://localhost:8003/v1
+  api_base: http://localhost:8010/v1
   model: /model
   temperature: 0.3          # Low temp — critical for precise, consistent vision responses
   max_tokens: 512           # Hard cap on reasoning+answer combined; model uses ~150-275 total
diff --git a/pyproject.toml b/pyproject.toml
index e6118a2..ddbd68a 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -3,6 +3,7 @@ requires = ["setuptools>=61.0", "wheel"]
 build-backend = "setuptools.build_meta"
 
 [project]
+# Distribution name (hyphens); Python package is multi_modal_ai_studio (underscores)
 name = "multi-modal-ai-studio"
 version = "0.1.0"
 description = "Multi-modal AI interface with voice, text, and video support for analyzing conversational AI systems"
@@ -10,7 +11,9 @@ readme = "README.md"
 requires-python = ">=3.8"
 license = {text = "Apache-2.0"}
 authors = [
-    {name = "Your Name", email = "your.email@example.com"}
+    {name = "Chitoku YATO (tokk-nv)", email = "cyato@nvidia.com"},
+    {name = "Aditya Sahu (adsahu-nv)", email = "adsahu@nvidia.com"},
+    {name = "kbenkhaled", email = "kbenkhaled@nvidia.com"},
 ]
 keywords = ["ai", "voice", "multimodal", "riva", "nvidia", "openai", "conversational-ai"]
 classifiers = [
@@ -61,10 +64,10 @@ webrtc-camera = [
 multi-modal-ai-studio = "multi_modal_ai_studio.cli.main:main"
 
 [project.urls]
-Homepage = "https://github.com/yourusername/multi-modal-ai-studio"
-Documentation = "https://github.com/yourusername/multi-modal-ai-studio/tree/main/docs"
-Repository = "https://github.com/yourusername/multi-modal-ai-studio"
-Issues = "https://github.com/yourusername/multi-modal-ai-studio/issues"
+Homepage = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio"
+Documentation = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio/tree/main/docs"
+Repository = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio"
+Issues = "https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio/issues"
 
 [tool.setuptools.packages.find]
 where = ["src"]
diff --git a/src/multi_modal_ai_studio/backends/llm/openai.py b/src/multi_modal_ai_studio/backends/llm/openai.py
index ff821be..8389660 100644
--- a/src/multi_modal_ai_studio/backends/llm/openai.py
+++ b/src/multi_modal_ai_studio/backends/llm/openai.py
@@ -346,6 +346,7 @@ async def list_available_models(self) -> List[str]:
         """List available models from the LLM API.
 
         Attempts to detect models from Ollama's native API or OpenAI endpoint.
+        We probe /api/tags first (Ollama can run on any port), then fall back to /v1/models.
 
         Returns:
             List of model names, or empty list if detection fails
@@ -396,6 +397,13 @@ async def list_available_models(self) -> List[str]:
     # Vision content formatting — one method per API format
     # -----------------------------------------------------------------
 
+    def _api_supports_video_url(self) -> bool:
+        """True if the API is known to support video_url. Ollama only supports image_url."""
+        base = (self.api_base or "").lower()
+        if ":11434" in base or "ollama" in base:
+            return False
+        return True
+
     def _build_vision_content(
         self,
         image_data_urls: List[str],
@@ -404,15 +412,14 @@ def _build_vision_content(
     ) -> list:
         """Build the multimodal ``content`` list for a user message.
 
-        Two paths controlled purely by the user's ``vision_video_encode`` config:
-        - True  → encode frames as MP4 video (``video_url``), fall back to images on failure
-        - False → send individual images (``image_url``)
-
-        No backend detection, no sub-sampling, no hardcoded caps.
+        Two paths: video (video_url) when enabled and API supports it, else images (image_url).
+        Ollama does not support video_url and returns 400; we send image_url only for it.
         """
         content: list = [{"type": "text", "text": prompt}]
 
-        use_video = bool(getattr(self.config, "vision_video_encode", False))
+        use_video = bool(getattr(self.config, "vision_video_encode", False)) and self._api_supports_video_url()
+        if getattr(self.config, "vision_video_encode", False) and not self._api_supports_video_url():
+            self.logger.debug("Vision: API does not support video_url (e.g. Ollama); using image_url only")
 
         if use_video and len(image_data_urls) >= 2:
             video_url = _encode_images_to_video_base64(
diff --git a/src/multi_modal_ai_studio/webui/static/app.js b/src/multi_modal_ai_studio/webui/static/app.js
index f2bf910..4fb7fe3 100644
--- a/src/multi_modal_ai_studio/webui/static/app.js
+++ b/src/multi_modal_ai_studio/webui/static/app.js
@@ -1295,7 +1295,7 @@ function renderLLMConfig(config, readonly = false) {
                         <div class="api-presets-menu" id="presetsMenu" style="display: none;">
                             <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:11434/v1', 'Ollama')"><strong>Ollama</strong><span>http://localhost:11434/v1</span></div>
                             <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:8000/v1', 'vLLM')"><strong>vLLM</strong><span>http://localhost:8000/v1</span></div>
-                            <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:8003/v1', 'vLLM (8003)')"><strong>vLLM (8003)</strong><span>http://localhost:8003/v1</span></div>
+                            <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:8010/v1', 'vLLM (8010)')"><strong>vLLM (8010)</strong><span>http://localhost:8010/v1</span></div>
                             <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:30000/v1', 'SGLang')"><strong>SGLang</strong><span>http://localhost:30000/v1</span></div>
                             <div class="api-preset-item" onclick="selectLLMPreset('http://localhost:58010/v1', 'TensorRT Edge LLM')"><strong>TensorRT Edge LLM</strong><span>http://localhost:58010/v1</span></div>
                             <div class="api-preset-item" onclick="selectLLMPreset('http://10.110.51.30:8801/v1', 'LLM Router')"><strong>LLM Router</strong><span>http://10.110.51.30:8801/v1</span></div>