This document provides an overview of the WebSocket-based real-time transcription feature in the SIPREC server.
The SIPREC server now supports real-time streaming of transcriptions via WebSockets. This allows clients to receive transcription updates in real-time as they are generated by the speech-to-text (STT) providers.
- Real-time streaming: Both interim and final transcriptions are streamed as they become available
- Call-specific subscriptions: Clients can subscribe to transcriptions for specific calls by UUID
- Metadata enrichment: Transcriptions include metadata like confidence scores, provider info, and word counts
- Simple client interface: An HTML/JavaScript client is provided for easy testing and integration
- Publish-subscribe architecture: Modular design with a transcription service and WebSocket hub
The real-time transcription system consists of the following components:
- TranscriptionService: Central service that manages transcription events and notifies listeners
- TranscriptionListener: Interface for components that want to receive transcription updates
- WebSocketTranscriptionBridge: Bridge between the transcription service and WebSocket hub
- TranscriptionHub: Manages WebSocket connections and broadcasts messages to clients
- WebSocketHandler: HTTP handler for WebSocket connections
The flow is as follows:
- STT providers generate transcriptions as they process audio
- Providers publish transcriptions to the TranscriptionService
- The TranscriptionService notifies all registered listeners, including the WebSocketTranscriptionBridge
- The WebSocketTranscriptionBridge forwards messages to the TranscriptionHub
- The TranscriptionHub broadcasts messages to connected WebSocket clients
ws://<server-host>:<server-port>/ws/transcriptions?call_uuid=<optional-call-uuid>
- call_uuid: Optional parameter to subscribe to transcriptions for a specific call only
{
"call_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"transcription": "This is the transcription text",
"is_final": true,
"timestamp": "2023-06-01T12:34:56.789Z",
"metadata": {
"provider": "google",
"confidence": 0.95,
"word_count": 5
}
}
- call_uuid: Unique identifier for the call
- transcription: The transcribed text
- is_final: Whether this is a final (true) or interim (false) transcription
- timestamp: ISO timestamp when the transcription was generated
- metadata: Additional information about the transcription
- provider: The name of the STT provider that generated the transcription
- confidence: Confidence score (0-1) for the transcription (final transcriptions only)
- word_count: Number of words in the transcription (final transcriptions only)
A simple HTML client is provided for testing the WebSocket functionality:
http://<server-host>:<server-port>/websocket-client
This client allows you to:
- Connect to the WebSocket endpoint
- Subscribe to transcriptions for a specific call or all calls
- See both interim and final transcriptions in real-time
- View metadata for each transcription
For testing purposes, a mock STT provider is included that generates random transcriptions. To use it:
- Start the SIPREC server
- Run the WebSocket test script:
go run test_websocket.go
- Open the WebSocket client in your browser
- Use one of the call UUIDs displayed by the test script to subscribe to a specific call
To make a custom STT provider work with the real-time transcription system:
- Add a transcription service field to your provider struct:
type MyProvider struct {
// ... existing fields
transcriptionSvc *TranscriptionService
}
- Implement a method to set the transcription service:
func (p *MyProvider) SetTranscriptionService(svc *TranscriptionService) {
p.transcriptionSvc = svc
}
- Publish transcriptions as they become available:
// For interim results
p.transcriptionSvc.PublishTranscription(callUUID, interim, false, metadata)
// For final results
p.transcriptionSvc.PublishTranscription(callUUID, transcription, true, metadata)
To create a custom client that consumes the WebSocket stream:
- Establish a WebSocket connection to the endpoint
- Handle incoming JSON messages
- Parse and process the transcription data as needed
Example JavaScript code:
const socket = new WebSocket('ws://localhost:9090/ws/transcriptions');
socket.addEventListener('message', function(event) {
const data = JSON.parse(event.data);
console.log('Transcription:', data.transcription);
console.log('Final:', data.is_final);
console.log('Metadata:', data.metadata);
});
- The WebSocket hub uses non-blocking channels to avoid blocking the main application
- Separate goroutines are used for writing to each client to prevent slow clients from affecting others
- The WebSocket hub implementation is thread-safe with proper mutex usage
- Regular ping messages maintain connection health
- Error handling with proper cleanup ensures resources are released when connections close
Potential future enhancements to the real-time transcription system:
- Authentication: Add token-based authentication for WebSocket connections
- Compression: Support WebSocket compression for reduced bandwidth
- Metrics: Add instrumentation for monitoring WebSocket connections and message throughput
- Filtering: Allow clients to filter transcriptions by additional criteria (e.g., confidence level)
- Batching: Optimize performance with client-side message batching for high-volume scenarios