Skip to content

Commit aa55415

Browse files
authored
Merge pull request #33 from NVIDIA-AI-IOT/feat/demo
Update installation instructions and refine system prompts across presets
2 parents 04e5b84 + 642bdc2 commit aa55415

12 files changed

Lines changed: 131 additions & 35 deletions

File tree

INSTALL.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,11 @@ sudo apt-get install -y portaudio19-dev
6565

6666
### 2. Virtual Environment
6767

68+
> **Note:** If `python3 -m venv` fails with "No module named venv", install it first:
69+
> ```bash
70+
> sudo apt install python3.12-venv
71+
> ```
72+
6873
```bash
6974
# Create venv
7075
python3 -m venv .venv
@@ -392,6 +397,11 @@ The second volume `-v ${HOME}/.cache/vllm:/root/.cache/vllm` persists vLLM’s *
392397
>
393398
> **Memory tuning**: On shared-memory systems (Jetson), lower `--gpu-memory-utilization` to leave room for the OS, Riva, and the application. On discrete GPUs with dedicated VRAM, `0.8` is safe.
394399
>
400+
> **GPU memory cleanup**: If vLLM fails to start with an OOM error after stopping another GPU container, free cached memory first:
401+
> ```bash
402+
> sudo sysctl -w vm.drop_caches=3
403+
> ```
404+
>
395405
> **Desktop GPU / x86_64**: Use `vllm/vllm-openai:latest` or `nvcr.io/nvidia/vllm:latest` instead of the Jetson image.
396406
397407
### vLLM troubleshooting

presets/cosmos-reason.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,10 @@ llm:
2020
temperature: 0.3 # Low temp — critical for precise, consistent vision responses
2121
max_tokens: 512 # Hard cap on reasoning+answer combined; model uses ~150-275 total
2222
history_turns: 0 # Disabled — text-only history anchors VLM to prior answers
23-
system_prompt: "You analyze live video from the user's camera. Answer based on what you see. Be precise and concise — 1-2 short sentences only."
23+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
2424
enable_vision: true
25-
vision_system_prompt: "You analyze live video from the user's camera. Answer based on what you see. Be precise and concise — 1-2 short sentences only."
26-
vision_frames: 100 # Video mode: request many frames for temporal video encoding
25+
vision_system_prompt: "You are a helpful voice and vision assistant. Give ONE short sentence answers only. Be direct. Plain text only, no markdown, no bullet points, no emojis."
26+
vision_frames: 30 # Video mode: request many frames for temporal video encoding
2727
vision_detail: auto
2828
vision_quality: 0.8
2929
vision_max_width: 768

presets/default.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,11 @@ llm:
2727
max_tokens: 512
2828
minimal_output: false
2929
stream: true
30-
system_prompt: You are a helpful voice assistant.
30+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
3131
extra_request_body: ''
3232
cheap_model: nvidia/cosmos-reason2-8b-fp8
3333
enable_vision: true
34-
vision_system_prompt: You are a vision assistant. Give ONE short sentence answers
35-
only. Be direct. No explanations. Use plain text only — no markdown or formatting.
34+
vision_system_prompt: "You are a helpful voice and vision assistant. Give ONE short sentence answers only. Be direct. Plain text only, no markdown, no bullet points, no emojis."
3635
vision_detail: auto
3736
vision_frames: 10
3837
vision_quality: 0.8

presets/high-accuracy.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ llm:
1818
model: llama3.1:8b # Larger, more capable model
1919
temperature: 0.5 # Lower temperature for more consistent responses
2020
max_tokens: 1024 # Allow longer, detailed responses
21-
system_prompt: "You are a knowledgeable voice assistant. Provide thorough, accurate responses."
21+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
2222

2323
tts:
2424
scheme: riva

presets/llm-router.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
name: "LLM Router (MoM + Local VLM)"
4+
description: "Remote MoM model via LLM router with local Edge LLM as utility model for titles and vision"
5+
6+
asr:
7+
scheme: riva
8+
server: localhost:50051
9+
model: parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer
10+
language: en-US
11+
vad_start_threshold: 0.5
12+
vad_stop_threshold: 0.3
13+
speech_pad_ms: 600
14+
speech_timeout_ms: 1200
15+
16+
llm:
17+
scheme: openai
18+
api_base: http://10.110.51.30:8801/v1
19+
model: MoM
20+
temperature: 0.3
21+
max_tokens: 512
22+
history_turns: 0
23+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
24+
enable_vision: true
25+
vision_system_prompt: "You are a helpful voice and vision assistant. Give ONE short sentence answers only. Be direct. Plain text only, no markdown, no bullet points, no emojis."
26+
vision_frames: 30
27+
vision_detail: auto
28+
vision_quality: 0.8
29+
vision_max_width: 768
30+
vision_buffer_fps: 5.0
31+
vision_video_encode: true
32+
enable_reasoning: true
33+
34+
tts:
35+
scheme: riva
36+
server: localhost:50051
37+
voice: ""
38+
sample_rate: 22050
39+
stream_tts: true
40+
41+
devices:
42+
audio_input_source: browser
43+
audio_output_source: browser
44+
45+
app:
46+
barge_in_enabled: true
47+
timeline_position: right
48+
session_auto_save: true
49+
session_output_dir: ./sessions
50+
theme: dark

presets/low-latency.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ llm:
1818
model: llama3.2:3b # Fast small model
1919
temperature: 0.7
2020
max_tokens: 256 # Shorter responses
21-
system_prompt: "You are a helpful voice assistant. Keep responses concise."
21+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
2222

2323
tts:
2424
scheme: riva

presets/openai-realtime.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ llm:
1919
model: gpt-4o-realtime-preview
2020
temperature: 0.7
2121
max_tokens: 512
22-
system_prompt: "You are a helpful voice assistant."
22+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
2323

2424
tts:
2525
scheme: openai-realtime

presets/tensorrt-edge-cosmos.yaml

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,36 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
name: "Cosmos-Reason2 (TensorRT Edge LLM)"
4-
description: "Cosmos-Reason2 on TensorRT Edge LLM backend — optimized edge inference with image input"
4+
description: "Cosmos-Reason2 on TensorRT Edge LLM backend — optimized edge inference with video input"
55

66
asr:
77
scheme: riva
88
server: localhost:50051
9-
model: conformer
9+
model: parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer
1010
language: en-US
1111
vad_start_threshold: 0.5
1212
vad_stop_threshold: 0.3
13-
speech_timeout_ms: 500
13+
speech_pad_ms: 600
14+
speech_timeout_ms: 1200
1415

1516
llm:
1617
scheme: openai
1718
api_base: http://localhost:58010/v1
18-
model: qwen3-vl
19+
model: /workspace/cosmos_onnx/visual-fp16
1920
temperature: 0.3
2021
max_tokens: 512
22+
history_turns: 0
2123
enable_reasoning: false
22-
system_prompt: "You are a vision assistant observing the user through a live camera. Answer directly in one short sentence. Do not think step-by-step or explain your reasoning."
24+
extra_request_body: '{"chat_template_kwargs": {"enable_thinking": false}}'
25+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
2326
enable_vision: true
24-
vision_system_prompt: "You are a vision assistant observing the user through a live camera. Answer directly in one short sentence. Do not think step-by-step or explain your reasoning."
25-
vision_frames: 3
27+
vision_system_prompt: "You are a helpful voice and vision assistant. Give ONE short sentence answers only. Be direct. Plain text only, no markdown, no bullet points, no emojis."
28+
vision_frames: 30
2629
vision_detail: auto
27-
vision_quality: 0.7
28-
vision_max_width: 640
29-
vision_buffer_fps: 3.0
30-
vision_video_encode: false
30+
vision_quality: 0.8
31+
vision_max_width: 768
32+
vision_buffer_fps: 5.0
33+
vision_video_encode: true
3134

3235
tts:
3336
scheme: riva
@@ -37,7 +40,6 @@ tts:
3740
stream_tts: true
3841

3942
devices:
40-
video_source: browser
4143
audio_input_source: browser
4244
audio_output_source: browser
4345

presets/text-only.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ llm:
1212
model: llama3.2:3b
1313
temperature: 0.7
1414
max_tokens: 512
15-
system_prompt: "You are a helpful AI assistant."
15+
system_prompt: "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
1616

1717
tts:
1818
scheme: none

src/multi_modal_ai_studio/config/schema.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ class LLMConfig:
139139
temperature: float = 0.7
140140
max_tokens: int = 512
141141
minimal_output: bool = False
142-
system_prompt: str = "You are a helpful voice assistant."
142+
system_prompt: str = "You are a helpful voice AI assistant. Plain text only, no markdown, no bullet points, no emojis."
143143
extra_request_body: Optional[str] = None
144144
top_p: float = 1.0
145145
frequency_penalty: float = 0
@@ -153,7 +153,7 @@ class LLMConfig:
153153
# When enable_vision=True, camera frames are captured and sent with each prompt
154154
# -------------------------------------------------------------------------
155155
enable_vision: bool = False # Set True for VLM models (Cosmos-Reason, LLaVA, GPT-4V, etc.)
156-
vision_system_prompt: str = "You are a vision assistant. Give ONE short sentence answers only. Be direct. No explanations."
156+
vision_system_prompt: str = "You are a helpful voice and vision assistant. Give ONE short sentence answers only. Be direct. Plain text only, no markdown, no bullet points, no emojis."
157157
vision_detail: Literal["low", "high", "auto"] = "auto" # OpenAI vision detail level
158158
vision_frames: int = 4 # Frames per turn (1=single at speech end, 2-10=during speech)
159159
vision_quality: float = 0.7 # JPEG quality (0.3=fast/small, 1.0=high quality)

0 commit comments

Comments
 (0)