A professional-grade full-stack application for recording audio in the browser and transcribing it locally using NVIDIA's Parakeet ASR model with advanced speaker diarization.
- 🎙️ Live Audio Recording: High-quality audio capture directly from your browser
- 🤖 Advanced ASR: Dual-engine support (NVIDIA Parakeet + OpenAI Whisper)
- 🔊 Speaker Diarization: Automatic identification of speakers using pyannote.audio (~90% accuracy)
- 📨 Live Transcription: Real-time message-based transcript display with smart pause detection
- 💾 Transcript History: Browse, search, and manage all saved transcripts
- 📥 Export Options: Download transcripts as text files or view as JSON
- 📋 Clipboard Support: Copy transcripts with one click
- 🌐 Multi-user: Support for 3+ concurrent transcription users
- 🚀 100% Local Processing: All transcription happens locally, no cloud APIs
- 🎯 Intelligent VAD: Voice Activity Detection with natural pause recognition
.
├── frontend/ # Next.js frontend application
│ ├── app/
│ │ ├── components/
│ │ │ ├── AudioRecorder.tsx
│ │ │ └── TranscriptHistory.tsx
│ │ ├── layout.tsx
│ │ ├── page.tsx
│ │ └── globals.css
│ ├── package.json
│ ├── tsconfig.json
│ └── tailwind.config.js
└── backend/ # Python FastAPI backend
├── main.py
├── database.py
├── schemas.py
├── requirements.txt
└── .env.example
- Node.js 18+ (for frontend)
- Python 3.9+ (for backend)
- GPU (recommended for faster transcription, NVIDIA CUDA preferred)
cd backend
python -m venv venv
# On macOS/Linux
source venv/bin/activate
# On Windows
venv\Scripts\activatepip install -r requirements.txtNote: The first time you run the application, Parakeet will download the pre-trained model (~1.5GB), which may take a few minutes.
cp .env.example .envuvicorn main:app --reload --host 0.0.0.0 --port 8000The backend will be available at http://localhost:8000
cd frontend
npm installnpm run devThe frontend will be available at http://localhost:3000
- Open
http://localhost:3000in your browser - Click "Start Recording" to begin recording audio
- Speak into your microphone
- Click "Stop Recording" when done
- Wait for the transcription to complete
- View your transcript, copy it, or download it as a text file
- Your transcripts are saved in the "Transcript History" section
Upload an audio file for transcription
- Parameters:
file: Audio file (WAV, MP3, etc.)title: Optional title for the transcript
Get all saved transcripts
Get a specific transcript by ID
Delete a transcript
WebSocket endpoint for streaming transcription (future enhancement)
- Make sure your browser has microphone permissions
- Check your browser's privacy settings
- Try using HTTPS (required for microphone access on non-localhost)
- Make sure you have internet connection for the first run (to download the model)
- Check that your GPU has enough memory (recommended: 4GB+)
- For CPU-only, transcription will be slower
- Frontend: Change port in
npm run dev -- -p 3001 - Backend: Change port in
uvicorn main:app --port 8001 - Update CORS origins in
backend/main.pyaccordingly
- GPU: With NVIDIA GPU, transcription typically takes 2-5 seconds per minute of audio
- CPU: Without GPU, transcription may take 30+ seconds per minute of audio
- The Parakeet model requires about 1.5GB of disk space and ~2GB of RAM
./start.shBackend:
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reloadFrontend:
cd frontend
npm install
npm run devThen open http://localhost:3000 in your browser.
- CLAUDE.md - Detailed architecture, command reference, and implementation notes
- CONTRIBUTING.md - How to contribute to this project
- GETTING_STARTED.md - Detailed setup instructions
- API.md - API endpoint documentation
The system uses pyannote.audio to identify and track speakers with ~90% accuracy:
- Analyzes complete audio after recording
- Automatically detects number of speakers
- Labels segments with speaker attribution
- Graceful fallback if pyannote unavailable
Smart pause detection breaks natural speech into manageable segments:
- 0.3-0.5s pause triggers transcription
- 1.5+ seconds confirms end of phrase
- 15+ second buffer forces send to prevent huge blocks
- Up to 3 concurrent transcriptions
- Configurable worker threads
- WebSocket support for real-time updates
- PostgreSQL or SQLite database backend
| Scenario | Time | Hardware |
|---|---|---|
| 1 min audio (GPU) | 5-10s | NVIDIA GPU + 8GB RAM |
| 1 min audio (CPU) | 5-10 min | CPU-only |
| Multi-user (3 concurrent) | Sequential | ~3x slower per user |
For better speaker diarization model access:
# Create token at https://huggingface.co/settings/tokens
# Add to .env:
HUGGINGFACE_TOKEN=hf_your_token_here# Default: SQLite (auto-created)
# For PostgreSQL:
DATABASE_URL=postgresql://user:pass@localhost/transcriber_db- Support for multiple languages
- Confidence scores for transcribed text
- Batch transcription
- Audio file upload without recording
- Real-time speaker identification display
- Custom speaker name assignment
- Transcript search and filtering
- Timestamp-based playback
- Speaker demographics (experimental)
This project uses open-source components:
- Frontend: Next.js (MIT)
- Backend: FastAPI (MIT)
- ASR Model: Nvidia Parakeet (Apache 2.0)
For issues or questions:
- Check the troubleshooting section
- Verify all prerequisites are installed
- Check that both backend and frontend servers are running
- View browser console and backend logs for error messages