A real-time audio transcription proof-of-concept using Faster-Whisper and WebSockets
Conventional transcription systems send long recordings (sometimes 90 s or more) for processing and make users wait tens of seconds for results.
This PoC prototype demonstrates how to stream audio continuously to a backend and receive incremental transcriptions in near real-time.
It was created as a small project for a German AI startup exploring the move from slow batch uploads to a fully streamed pipeline.
- Streams raw PCM
float32audio chunks from client → FastAPI backend through WebSockets - Backend transcribes chunks on the fly with Faster-Whisper
- Sends partial transcription text back to the client after each chunk
- Gives the user the feeling of live captioning instead of waiting for large batch results
| Layer | Technology |
|---|---|
| Backend Framework | FastAPI + WebSockets |
| Speech-to-Text Model | Faster-Whisper (base) |
| Language | Python 3.10+ |
| Client | Python WebSockets sender (replaceable with microphone input) |
| Audio Format | Mono 16 kHz float32 PCM |
git clone https://github.com/rohan9024/StreamTranscriber.git
cd StreamTranscriberpython -m venv venv
# Activate it
# Windows:
venv\Scripts\activate
# macOS / Linux:
source venv/bin/activatepip install -r requirements.txtuvicorn server:app --host 127.0.0.1 --port 8001 --reloadpython client.pyNote:
- Your
sample_16k.wavfile should be mono and 16 kHz. - If your file is stereo, the client automatically converts it to mono.
- Client loads a
.wavfile, breaks it into ~2 s chunks, and sends each chunk via WebSocket. - Server receives chunk → converts from bytes to numpy array → transcribes with Faster-Whisper.
- Server immediately returns partial text to the client.
- The process continues until the entire audio stream is processed.
Console snippet while running:
📦 Sent 32000 samples
📝 Partial transcription: Hello everyone and welcome to the meeting...
📦 Sent 32000 samples
📝 Partial transcription: Today we'll discuss progress on the new release...
| Approach | Delay Before Text | User Experience |
|---|---|---|
| 90 s file upload | ≈ 45 s | Feels batch-processed |
| Streamed PCM chunks | 1 – 2 s | Feels live and conversational |
By streaming smaller chunks, the user starts seeing text updates almost instantly instead of waiting for the full recording to complete.
- Real-time microphone input (continuous streaming)
- Better buffering / overlap for smoother context
- Frontend dashboard that displays text live (React / JS)
- Integration with summarization or meeting-notes generators
This PoC came from a common startup bottleneck:
"Recording 90 s of audio and waiting another 45 s for a transcript isn't practical for live use."
StreamTranscriber proves that even lightweight infrastructure can deliver real-time AI transcription using open-source tools.
Create a file requirements.txt (if you don't have one already) with:
fastapi==0.109.0
uvicorn==0.25.0
numpy==1.26.0
soundfile==0.12.1
websockets==12.0
faster-whisper==0.9.0
- Faster-Whisper for efficient ASR inference
- FastAPI for its excellent async & WebSocket support
- Open-source community for continued innovation in AI infrastructure
Rohan R. Wandre
ML Engineer
Feel free to fork, modify, or use this project as a reference for building real-time transcription pipelines.