Welcome, contributor! This guide provides a straightforward path to creating custom Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS) plugins for the VideoSDK Agent Framework.
- Fork the repository on GitHub.
- Study the base classes in
videosdk-agents/videosdk/agents/. - Review existing plugins (e.g.,
videosdk-plugins-openai) for practical examples. - Create your plugin using the directory structure and guidance below.
- Submit a Pull Request for our team to review.
Your plugin must follow this exact structure to be compatible with the framework.
videosdk-plugins/
└── videosdk-plugins-{your-service}/
├── pyproject.toml
|-- README.md
└── videosdk/
└── plugins/
└── {your-service}/
├── __init__.py
├── version.py
├── stt.py (optional)
├── llm.py (optional)
└── tts.py (optional)
These files are essential for your plugin's packaging and initialization.
This file manages your plugin's dependencies and packaging.
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "videosdk-plugins-{your-service}"
dynamic = ["version"]
description = "VideoSDK Agent Framework plugin for {Your Service}"
readme = "README.md"
license = "Apache-2.0"
requires-python = ">=3.11"
authors = [{ "name": "videosdk" }]
dependencies = [
"videosdk-agents>=0.0.35",
# Add any other required dependencies here, e.g., "openai>=1.0.0"
]
[tool.hatch.version]
path = "videosdk/plugins/{your-service}/version.py"
[tool.hatch.build.targets.wheel]
packages = ["videosdk"]A brief but informative README is crucial.
# VideoSDK - {Your Service} Plugin
This plugin integrates {Your Service} with the VideoSDK Agent Framework.
**Implemented Features:**
- [x] STT
- [ ] LLM
- [ ] TTS
**API Key Setup:**
Set the `{YOUR_SERVICE}_API_KEY` environment variable.__version__ = "0.0.1"This file makes your plugin's classes importable.
# Import and expose only the classes you have implemented
from .stt import YourServiceSTT
from .llm import YourServiceLLM
from .tts import YourServiceTTS
__all__ = [
'YourServiceSTT',
'YourServiceLLM',
'YourServiceTTS'
]Inherit from the base classes and implement the required abstract methods.
from typing import Optional
from videosdk.agents import STT as BaseSTT, STTResponse, SpeechEventType, SpeechData
class YourServiceSTT(BaseSTT):
def __init__(self, api_key: str, **kwargs):
super().__init__()
self.api_key = api_key
# Initialize your WebSocket client and other resources here
async def process_audio(self, audio_frames: bytes, language: Optional[str] = None, **kwargs):
"""
Process incoming audio frames. The framework provides audio at 48kHz.
Resample if your provider requires a different sample rate.
"""
# 1. Send audio to your service's WebSocket endpoint.
# 2. Receive transcription data.
# 3. Call self._transcript_callback(response) with STTResponse objects.
pass
async def aclose(self):
"""Clean up all resources, like closing WebSocket connections."""
# Your cleanup logic here
passfrom typing import Any, AsyncIterator, List, Optional
from videosdk.agents import LLM as BaseLLM, LLMResponse, ChatContext, ChatRole, FunctionTool
class YourServiceLLM(BaseLLM):
def __init__(self, api_key: str, **kwargs):
super().__init__()
self.api_key = api_key
# Initialize your HTTP client here
async def chat(
self,
messages: ChatContext,
tools: Optional[List[FunctionTool]] = None,
**kwargs: Any
) -> AsyncIterator[LLMResponse]:
"""
Process messages and yield responses in a stream.
"""
# 1. Format the 'messages' and 'tools' for your provider's API.
# 2. Make a streaming API request.
# 3. As you receive response chunks, yield LLMResponse objects.
yield LLMResponse(content="Hello, world!", role=ChatRole.ASSISTANT)from typing import Any, AsyncIterator, Optional, Union
from videosdk.agents import TTS as BaseTTS
class YourServiceTTS(BaseTTS):
def __init__(self, api_key: str, **kwargs):
# Set the sample rate and channels your provider uses.
super().__init__(sample_rate=24000, num_channels=1)
self.api_key = api_key
self.audio_track = None # This is set by the framework
async def synthesize(self, text: Union[AsyncIterator[str], str], voice_id: Optional[str] = None, **kwargs: Any):
"""
Convert text to speech and stream the audio data.
"""
# 1. Make a streaming API request to your provider with the text.
# 2. As you receive audio chunks, push them to the audio track.
# if self.audio_track:
# asyncio.create_task(self.audio_track.add_new_bytes(chunk))
pass
async def interrupt(self):
"""Interrupt any ongoing audio synthesis."""
if self.audio_track:
self.audio_track.interrupt()- Inherit from the correct base class (
STT,LLM, orTTS). - Implement all abstract methods defined in the base class.
- Emit errors consistently using
self.emit("error", message). - Clean up all resources (clients, connections) in the
aclose()method. - Use async patterns correctly for all I/O operations.
- Handle WebSocket connections gracefully (connect, disconnect, errors).
- The framework provides 48kHz audio. Resample audio if your provider requires a different sample rate.
- Ensure the
chat()method is a streaming implementation (usingasync forandyield). - Support function tools if the provider's API allows for it.
- Set the correct
sample_rateandnum_channelsin thesuper().__init__()call. - Push audio chunks to
self.audio_trackfor playback.
Before submitting, install and test your plugin locally.
# Install your plugin in editable mode from its root directory
pip install -e .
# Create a test script (e.g., test_my_plugin.py)
# and run it to verify functionality.
python test_my_plugin.pyWe highly recommend running your plugin with the example scripts in the examples/ directory to catch breaking changes or integration issues early.
Once your plugin is ready:
- Push your branch to your fork.
- Create a Pull Request against the main repository.
- Fill out the PR template with details about your service, implemented features, and any special configuration notes.
- Base Classes:
videosdk-agents/videosdk/agents/ - Plugin Examples:
videosdk-plugins/videosdk-plugins-openai/