Surgical Agentic Framework Demo

The Surgical Agentic Framework Demo is a multimodal agentic AI framework tailored for surgical procedures. It supports:

Speech-to-Text: Real-time audio is captured, transcribed by Whisper.
VLM/LLM-based Conversational Agents: A selector agent decides which specialized agent to invoke:
- ChatAgent for general Q&A,
- NotetakerAgent to record specific notes,
- AnnotationAgent to automatically annotate progress in the background,
- PostOpNoteAgent to summarize all data into a final post-operative note.
Text-to-Speech: The system can speak back the AI's response if you enable TTS. There are options for local TTS models (Coqui), as well as an ElevenLabs API.
Computer Vision or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by vLLM.
Video Upload and Processing: Support for uploading and analyzing surgical videos.
Live Streaming (WebRTC): Real-time analysis of live surgical streams via WebRTC with seamless mode switching between uploaded videos and live streams.
Post-Operation Note Generation: Automatic generation of structured post-operative notes based on the procedure data.

System Flow and Agent Overview

Microphone: The user clicks "Start Mic" in the web UI, or types a question.
Whisper ASR: Transcribes speech into text (via servers/whisper_online_server.py).
SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
- ChatAgent (general Q&A about the procedure)
- NotetakerAgent (records a note with timestamp + optional image frame)
- In the background, AnnotationAgent is also generating structured "annotations" every 10 seconds.
NotetakerAgent: If chosen, logs the note in a JSON file.
AnnotationAgent: Runs automatically, storing procedure annotations in procedure_..._annotations.json.
PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.

Dynamic Agent System

The framework now uses a dynamic agent loading system that automatically discovers and loads agents from configuration files:

Adding New Agents

To add a new agent to the system:

Create the agent class in agents/your_agent.py:

from agents.base_agent import Agent

class YourAgent(Agent):
    def __init__(self, settings_path, response_handler):
        super().__init__(settings_path, response_handler)

    def process_request(self, text, chat_history):
        # Your agent logic here
        return {"name": "YourAgent", "response": "..."}

Create the configuration file in configs/your_agent.yaml:

agent_metadata:
  name: "YourAgent"
  class_name: "YourAgent"
  module: "agents.your_agent"
  enabled: true
  category: "analysis"  # conversational, analysis, control, etc.
  priority: 10
  requires_llm: true
  requires_visual: false
  dependencies: []
  lifecycle: "singleton"  # or "background" for continuous agents

description: "Your agent's purpose"
agent_prompt: |
  Your agent's system prompt

ctx_length: 512
max_prompt_tokens: 3000

Restart the application - your agent will be automatically discovered and loaded!

The agent will be:

Automatically registered with the selector agent
Available for user queries
Properly initialized with all required dependencies

Agent Configuration Reference

name: Unique identifier for the agent instance
class_name: Python class name to instantiate
module: Python module path (e.g., agents.your_agent)
enabled: Set to false to disable without deleting
category: Used for grouping and filtering agents
priority: Lower numbers = higher priority for routing
requires_llm: Whether agent needs LLM access
requires_visual: Whether agent processes images/video
dependencies: External services the agent needs (frame queues, callbacks, etc.)
lifecycle: singleton (one instance) or background (continuous operation)

Plugin Directories (External Agents)

You can also load agents from external directories (e.g., custom workflows or proprietary agents):

Expected Structure:

~/my-custom-agents/
├── agents/
│   └── my_agent.py        # Python agent file
└── configs/
    └── my_agent.yaml      # Agent configuration

Configuration:

You can specify plugin directories in two ways.

Via global.yaml (located at configs/global.yaml in the root):

plugin_directories:
  - /home/user/my-custom-agents
  - ./local-plugins

Note that the system reads only from a single configs/global.yaml file in the root of this repository - you do not need a global.yaml in your plugin folder.

Via environment variable:

export AGENT_PLUGIN_DIRS="/home/user/my-custom-agents:/path/to/other-agents"

Behavior:

Plugin agents are loaded after core agents
If a plugin agent has the same name as a core agent, the plugin version overrides it
Only loads agents that have both .py and .yaml files with matching names
Invalid plugin directories are logged but don't stop the system

Example Use Case:

# Load custom workflow agents
export AGENT_PLUGIN_DIRS="~/pr3/i4h-workflows-internal/workflows/surgical_Agents"
./scripts/start_app.sh

The system will automatically discover and load agents from the plugin directory!

Generic Video Source System

The framework includes a configuration-driven video source management system that allows you to add and manage multiple video sources without modifying code.

Key Features

Configuration-Driven: Define video sources in YAML - no code changes needed
Auto-Detection: Automatically detect video source from WebSocket messages
Dynamic Routing: Generic selector lookup and context-aware frame fetching
Multi-Source Support: Handle unlimited video sources (surgical cameras, OR webcams, microscopes, etc.)
Priority-Based: Configure detection priority for multiple sources
Plugin Compatible: Works seamlessly with the plugin system

Quick Start

The system is already configured with two default video sources:

# configs/video_sources.yaml
video_sources:
  surgical:
    enabled: true
    display_name: "Surgical Camera"
    context_name: "procedure"
    source_type: "uploaded"
    auto_detect:
      websocket_flag: "auto_frame"
      frame_data_key: "frame_data"
    priority: 10

  operating_room:
    enabled: true
    display_name: "Operating Room Webcam"
    context_name: "operating_room"
    source_type: "livestream"
    auto_detect:
      websocket_flag: "operating_room_auto_frame"
      frame_data_key: "operating_room_frame_data"
    priority: 5

Adding a New Video Source

To add a new video source (e.g., a surgical microscope):

Add to configs/video_sources.yaml:

microscope:
  enabled: true
  display_name: "Surgical Microscope"
  description: "High-magnification microscope feed"
  source_type: "livestream"
  selector_config: "configs/selector.yaml"
  plugin_selector_pattern: "configs/microscope_selector.yaml"
  frame_queue_name: "microscope_frame_queue"
  context_name: "microscope"
  auto_detect:
    websocket_flag: "microscope_auto_frame"
    frame_data_key: "microscope_frame_data"
  priority: 7

(Optional) Create custom selector at configs/microscope_selector.yaml if you need specialized routing
Restart the application - that's it!

The system automatically:

Creates the frame queue
Finds and loads the appropriate selector
Routes requests correctly
Enables auto-detection

Usage in Code

The video source registry is automatically initialized and integrated:

# Get selector for current mode
video_mode = web.video_source_mode
selector = video_source_registry.get_selector(video_mode)
context = video_source_registry.get_context(video_mode)

# Process with appropriate selector
selected_agent_name, corrected_text, selector_context = selector.process_request(
    user_text, chat_history.to_list()
)

# Fetch frame for current mode
frame_data = _fetch_frame_for_mode(video_mode)

Switching Video Sources

Auto-Detection (recommended): The system automatically detects the video source based on WebSocket message flags configured in video_sources.yaml.

Manual Switching: Send a WebSocket message from the frontend:

socket.send(JSON.stringify({
  video_source_mode: "microscope"
}));

Benefits

✅ Zero-code extension: Add unlimited video sources via YAML only ✅ Auto-discovery: Automatically finds configurations and selectors ✅ Flexible routing: Each source can have its own specialized selector ✅ Context-aware: Proper context management for multi-source scenarios ✅ Well-tested: Comprehensive test suite included

Configuration Reference

Field	Description
`enabled`	Enable/disable the source
`display_name`	User-friendly name for UI
`description`	Source description
`source_type`	Type of source: `"uploaded"` for video files or `"livestream"` for WebRTC/live feeds
`selector_config`	Path to base selector config
`plugin_selector_pattern`	Path to plugin selector config
`frame_queue_name`	Queue identifier for frames
`context_name`	Context name for agent processing
`auto_detect.websocket_flag`	WebSocket message flag for detection
`auto_detect.frame_data_key`	Key for frame data in message
`priority`	Detection priority (higher = checked first)

Testing

Run the comprehensive test suite:

python -m pytest tests/test_video_source_registry.py -v

Video Session Tracking

Agents can detect when video sources reconnect or change via a video_source_session_id counter that auto-increments on:

WebRTC connections (frontend signals when stream starts)
Video uploads (new file uploaded)
Video selections (existing file selected)
Mode switches (different video source detected)

This prevents agents from maintaining stale state when reconnecting to the same source.

Usage: Add get_session_id to agent dependencies, then check for changes:

def __init__(self, settings_path, response_handler, get_session_id=None):
    super().__init__(settings_path, response_handler, get_session_id=get_session_id)
    self._last_session_id = None

def process_request(self, text, chat_history, visual_info=None):
    if self.get_session_id and self._last_session_id != self.get_session_id():
        self._last_session_id = self.get_session_id()
        # Reset agent state here

System Requirements

Python 3.12 or higher
Node.js 14.x or higher
CUDA-compatible GPU (recommended) for model inference
Microphone for voice input (optional)
16GB+ VRAM recommended

Installation

Clone or Download this repository:

git clone https://github.com/project-monai/vlm-surgical-agent-framework.git
cd VLM-Surgical-Agent-Framework

Setup vLLM (Optional)

vLLM is already configured in the project scripts. If you need to set up a custom vLLM server, see https://docs.vllm.ai/en/latest/getting_started/installation.html

Install Dependencies:

conda create -n surgical_agent_framework python=3.12
conda activate surgical_agent_framework
pip install -r requirements.txt

Note for Linux (PyAudio build): If pip install pyaudio fails with a missing header error like portaudio.h: No such file or directory, install the PortAudio development package first, then rerun pip install:

sudo apt-get update && sudo apt-get install -y portaudio19-dev
pip install -r requirements.txt

Install Node.js dependencies (for UI development):

Before installing, verify your Node/npm versions (Node ≥14; 18 LTS recommended):

node -v && npm -v

npm install

Models Folder:

Where to put things
- LLM checkpoints live in models/llm/
- Whisper (speech‑to‑text) checkpoints live in models/whisper/ (they will be downloaded automatically at runtime the first time you invoke Whisper).
Default LLM
- This repository is pre‑configured for NVIDIA Qwen2.5-VL-7B-Surg-CholecT50, a surgical‑domain fine‑tuned variant of Qwen2.5-VL-7B. You may choose to replace it with a finetuned VLM of your choosing.

Download the default model from Hugging Face with Git LFS:

# Download the checkpoint into the expected folder
hf download nvidia/Qwen2.5-VL-7B-Surg-CholecT50 \
  --local-dir models/llm/Qwen2.5-VL-7B-Surg-CholecT50 \
  --local-dir-use-symlinks False

Serving engine
- All LLMs are served through vLLM for streaming. Change the model path once in configs/global.yaml under model_name — both the agents and scripts/run_vllm_server.sh read this. You can override at runtime with VLLM_MODEL_NAME. To enable auto‑download when the folder is missing, set model_repo in configs/global.yaml (or export MODEL_REPO).
Resulting folder layout

models/
  ├── llm/
  │   └── Qwen2.5-VL-7B-Surg-CholecT50/   ← LLM model files
  └── whisper/                            ← Whisper models (auto‑downloaded)

Fine‑Tuning Your Own Surgical Model

If you want to adapt the framework to a different procedure (e.g., appendectomy, colectomy), you can fine‑tune a VLM and plug it into this stack with only config file changes. See:

FINETUNE.md — end‑to‑end guide covering:
- Data curation and scene metadata
- Visual‑instruction data generation (teacher–student)
- Packing data in LLaVA‑style format
- Training (LoRA/QLoRA) and validation
- Exporting and serving with vLLM, and updating configs

Setup:

Edit scripts/start_app.sh if you need to change ports.
Edit scripts/run_vllm_server.sh if you need to change quantization or VRAM utilization (4bit requires ~10GB VRAM). Model selection is controlled via configs/global.yaml.

Create necessary directories:

mkdir -p annotations uploaded_videos

Alternative: Docker Deployment

For easier deployment and isolation, you can use Docker containers instead of the traditional installation:

cd docker
./run-surgical-agents.sh

This will automatically download models, build all necessary containers, and start the services. See docker/README.md for detailed Docker deployment instructions.

Running the Surgical Agentic Framework Demo

Production Mode

Run the full stack with all services:

npm start

Or using the script directly:

./scripts/start_app.sh

What it does:

Builds the CSS with Tailwind
Starts vLLM server with the model on port 8000
Waits 45 seconds for the model to load
Starts Whisper (servers/whisper_online_server.py) on port 43001 (for ASR)
Waits 5 seconds
Launches python servers/app.py (the main Flask + WebSockets application)
Waits for all processes to complete

Development Mode

For UI development with hot-reloading CSS changes:

npm run dev:web

This starts:

The CSS watch process for automatic Tailwind compilation
The web server only (no LLM or Whisper)

For full stack development:

npm run dev:full

This is the same as production mode but also watches for CSS changes.

You can also use the development script for faster startup during development:

./scripts/dev.sh

Open your browser at http://127.0.0.1:8050. You should see the Surgical Agentic Framework Demo interface:
- A video sample (sample_video.mp4)
- Chat console
- A "Start Mic" button to begin ASR.
Try speaking or Typing:
- If you say "Take a note: The gallbladder is severely inflamed," the system routes you to NotetakerAgent.
- If you say "What are the next steps after dissecting the cystic duct?" it routes you to ChatAgent.
- If you ask record-specific questions like "What meds is the patient on?" or "Any abnormal labs?", it routes you to EHRAgent (after you build the EHR index; see below).
Background Annotations:
- Meanwhile, AnnotationAgent writes a file like: procedure_2025_01_18__10_25_03_annotations.json in the annotations folder very 10 seconds with structured timeline data.

Uploading and Processing Videos

The framework supports two video source modes:

Uploaded Videos

Click on the "Upload Video" button to add your own surgical videos
Browse the video library by clicking "Video Library"
Select a video to analyze
Use the chat interface to ask questions about the video or create annotations

Live Streaming (WebRTC)

The framework now supports real-time analysis of live surgical streams via WebRTC:

Toggle to Live Stream Mode: Select the "Live Stream" radio button in the video controls
Configure Server URL: Enter your WebRTC server URL (default: http://localhost:8080)
Connect: Click the "Connect" button to establish the WebRTC connection
Monitor Status: The connection status indicator will show:
- Yellow: Connecting...
- Green: Connected
- Red: Error
- Gray: Disconnected
Auto Frame Capture: The system automatically captures frames from the live stream for analysis
Disconnect: Click "Disconnect" when finished to cleanly close the connection

WebRTC Server Requirements:

The WebRTC server must provide the following API endpoints:
- /iceServers - Returns ICE server configuration
- /offer - Accepts WebRTC offer and returns answer
Compatible with the Holohub live video server application or any server implementing the same API

Features:

Seamless switching between uploaded videos and live streams
Automatic ICE server configuration with fallback STUN server
Proper connection state management and cleanup
Support for fullscreen and frame capture in both modes
Real-time video analysis capabilities

Generating Post-Operation Notes

After accumulating annotations and notes during a procedure:

Click the "Generate Post-Op Note" button
The system will analyze all annotations and notes
A structured post-operation note will be generated with:
- Procedure information
- Key findings
- Procedure timeline
- Complications

EHR Q&A (Vector DB)

This repository includes a lightweight EHR retrieval pipeline:

Build an EHR vector index from text/JSON files
Query the index via an EHRAgent with the same vLLM backend
A sample synthetic patient record is included at ehr/patient_history.txt to get you started

Steps:

Build the index from a directory of .txt, .md, or .json files

python scripts/ehr_build_index.py /path/to/ehr_docs ehr_index \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --chunk_tokens 256 --overlap_tokens 32

Point the agent at the index by editing configs/ehr_agent.yaml:

ehr_index_dir: set to ehr_index (or your output path)
Optionally adjust retrieval_top_k, context_max_chars

You can test by querying via CLI (uses the same vLLM server):

python scripts/ehr_query.py --question "What medications is the patient on?"

Integration in app selection:

`If the user asks about EHR/records (e.g., "labs", "medications", "allergies"), the request is routed to EHRAgent automatically.
Make sure vLLM is running (./scripts/run_vllm_server.sh) and the EHR index exists.

Troubleshooting

Common issues and solutions:

WebSocket Connection Errors:
- Check firewall settings to ensure ports 49000 and 49001 are open
- Ensure no other applications are using these ports
- If you experience frequent timeouts, adjust the WebSocket configuration in servers/web_server.py
Model Loading Errors:
- Verify model paths are correct in configuration files
- Ensure you have sufficient GPU memory for the models
- Check the log files for specific error messages
Audio Transcription Issues:
- Verify your microphone is working correctly
- Check that the Whisper server is running
- Adjust microphone settings in your browser
WebRTC Connection Issues:
- Ensure the WebRTC server is running and accessible at the configured URL
- Check that the server implements the required /iceServers and /offer endpoints
- Verify network connectivity and firewall settings for WebRTC ports
- Check browser console for detailed WebRTC connection errors
- Ensure the video element has autoplay and playsinline attributes for proper stream playback

Text-to-Speech (TTS)

The framework supports both local and cloud-based TTS options:

Local TTS Service (Recommended)

Benefits: Private, GPU-accelerated, Offline-capable

The TTS service uses a high-quality English VITS model (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) (tts_models/en/ljspeech/vits) that automatically downloads on first use. The model is stored persistently in ./tts-service/models/ and will be available across container restarts.

ElevenLabs TTS (Alternative)

For cloud-based premium quality TTS:

Configure your ElevenLabs API key in the web interface
No local storage or GPU resources required

File Structure

A brief overview:

surgical_agentic_framework/
├── agents/                 <-- Agent implementations
│   ├── annotation_agent.py
│   ├── base_agent.py
│   ├── chat_agent.py
│   ├── dynamic_selector_agent.py  <-- Dynamic agent routing
│   ├── ehr_agent.py
│   ├── notetaker_agent.py
│   ├── operating_room_agent.py
│   ├── post_op_note_agent.py
│   ├── robot_control_agent.py
│   └── selector_agent.py (legacy)
├── ehr/                    <-- Retrieval components for EHR
│   ├── builder.py          <-- Builds FAISS index from text/JSON
│   └── store.py            <-- Loads/queries the index
├── configs/                <-- Configuration files
│   ├── annotation_agent.yaml
│   ├── chat_agent.yaml
│   ├── notetaker_agent.yaml
│   ├── post_op_note_agent.yaml
│   └── selector.yaml
├── models/                 <-- Model files
│   ├── llm/                <-- LLM model files
│   │   └── Llama-3.2-11B-lora-surgical-4bit/
│   └── whisper/            <-- Whisper models (downloaded at runtime)
├── scripts/                <-- Shell scripts for starting services
│   ├── dev.sh              <-- Development script for quick startup
│   ├── run_vllm_server.sh
│   ├── start_app.sh        <-- Main script to launch everything
│   └── start_web_dev.sh    <-- Web UI development script
│   ├── ehr_build_index.py  <-- Build EHR vector index
│   └── ehr_query.py        <-- Query EHRAgent via CLI
├── servers/                <-- Server implementations
│   ├── app.py              <-- Main application server
│   ├── uploaded_videos/    <-- Storage for uploaded videos
│   ├── web_server.py       <-- Web interface server
│   └── whisper_online_server.py <-- Whisper ASR server
├── utils/                  <-- Utility classes and functions
│   ├── agent_registry.py   <-- Dynamic agent discovery and loading
│   ├── video_source_registry.py <-- Video source management system
│   ├── chat_history.py
│   ├── logging_utils.py
│   └── response_handler.py
├── web/                    <-- Web interface assets
│   ├── static/             <-- CSS, JS, and other static assets
│   │   ├── audio.js
│   │   ├── bootstrap.bundle.min.js
│   │   ├── bootstrap.css
│   │   ├── chat.css
│   │   ├── jquery-3.6.3.min.js
│   │   ├── main.js
│   │   ├── nvidia-logo.png
│   │   ├── styles.css
│   │   ├── tailwind-custom.css
│   │   └── websocket.js
│   └── templates/
│       └── index.html
├── annotations/            <-- Stored procedure annotations
├── uploaded_videos/        <-- Uploaded video storage
├── README.md               <-- This file
├── package.json            <-- Node.js dependencies and scripts
├── postcss.config.js       <-- PostCSS configuration for Tailwind
├── tailwind.config.js      <-- Tailwind CSS configuration
├── vite.config.js          <-- Vite build configuration
└── requirements.txt        <-- Python dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
agents		agents
configs		configs
docker		docker
ehr		ehr
scripts		scripts
servers		servers
tests		tests
tts-service		tts-service
utils		utils
web		web
.gitignore		.gitignore
FINETUNE.md		FINETUNE.md
README.md		README.md
THIRD_PARTY_NOTICES.txt		THIRD_PARTY_NOTICES.txt
jfk.flac		jfk.flac
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
requirements.txt		requirements.txt
tailwind.config.js		tailwind.config.js
vite.config.js		vite.config.js

Project-MONAI/VLM-Surgical-Agent-Framework

Folders and files

Latest commit

History

Repository files navigation

Surgical Agentic Framework Demo

System Flow and Agent Overview

Dynamic Agent System

Adding New Agents

Agent Configuration Reference

Plugin Directories (External Agents)

Generic Video Source System

Key Features

Quick Start

Adding a New Video Source

Usage in Code

Switching Video Sources

Benefits

Configuration Reference

Testing

Video Session Tracking

System Requirements

Installation

Fine‑Tuning Your Own Surgical Model

Alternative: Docker Deployment

Running the Surgical Agentic Framework Demo

Production Mode

Development Mode

Uploading and Processing Videos

Uploaded Videos

Live Streaming (WebRTC)

Generating Post-Operation Notes

EHR Q&A (Vector DB)

Troubleshooting

Text-to-Speech (TTS)

Local TTS Service (Recommended)

ElevenLabs TTS (Alternative)

File Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages