🇺🇸 English | 🇮🇳 हिंदी | 🇯🇵 日本語 | 🇨🇳 简体中文 | 🇪🇸 Español | 🇧🇷 Português (Brasil) | 🇰🇷 한국어 | 🇩🇪 Deutsch | 🇫🇷 Français
A browser-based AI transcription playground powered by Whisper and Transformers.js. No installation, registration, or payment required.
This project is a client-side transcription web app built with React, TypeScript, and Vite.
It runs Whisper directly in the browser through @huggingface/transformers, so media files are processed locally instead of being uploaded to a backend for transcription.
The current implementation supports selecting a Whisper model in the UI, choosing a local media file, loading the selected model on demand, and displaying the recognized text in a read-only transcript area.
-
Client-side speech-to-text
The React app calls theautomatic-speech-recognitionpipeline from@huggingface/transformersdirectly in the browser, so transcription runs entirely on the client. -
Simple 3-step workflow
The UI guides you through:- Loading the Whisper model.
- Checking model status.
- Uploading audio and running transcription, with clear status messages for each step.
-
In-browser transcription with
@huggingface/transformers -
Multilingual Whisper model selection in the UI
-
Supported built-in model options:
Xenova/whisper-tinyXenova/whisper-baseXenova/whisper-small
-
Client-side audio decoding to 16 kHz via
AudioContext -
Stereo-to-mono mixing before inference
-
Chunked transcription settings for longer media:
chunk_length_s: 20stride_length_s: 5
-
File input accepts:
audio/*video/mp4video/webmvideo/ogg.mp4.webm.ogv.m4v
- Frontend: React + TypeScript + Vite
- ML runtime:
@huggingface/transformers - Inference task:
automatic-speech-recognition - Browser audio handling: Web Audio API (
AudioContext) - Testing: Jest + Testing Library
- Container tooling: Docker + Docker Compose
App.tsx renders the app shell, title, subtitle, SettingsBar, and HomeScreen.
The settings bar currently displays the runtime summary:
Transformers.js + Whisper
HomeScreen.tsx provides a 3-step UI:
- Choose a model and media file
- Check model status
- Read the transcription result
The screen includes:
- A Whisper model dropdown
- A hidden file input triggered by a button
- Status text and spinner while processing
- A transcript textarea
- A Clear button
useTranscription.ts is the core implementation.
It exposes:
statuserrortranscriptavailableModelsselectedModelIdsetSelectedModelId(modelId)transcribeFile(file)reset()
Behavior:
- The selected Whisper model is loaded lazily on first use
- The pipeline instance is cached and reused if the same model remains selected
- Browser-friendly ONNX WASM settings are applied before model loading
- The selected file is read as an
ArrayBuffer - Audio is decoded with
AudioContext({ sampleRate: 16000 }) - Multi-channel audio is mixed down to mono
- Whisper runs with automatic language detection because
languageis intentionally left unset - The recognized text is written to the transcript state
The current UI reports user-facing states such as:
- idle: choose a model and a file
- loading: first model load may be slow
- ready: model loaded and ready
- transcribing: local browser transcription is running
- done: transcription finished
- error: failure message shown below the status block
The UI text says users can select audio or video files and that Whisper can detect speech from supported media such as MP3 or MP4 in the browser.
However, the actual implementation decodes the selected file using AudioContext.decodeAudioData(). In practice, successful decoding depends on browser codec support. That means supported behavior is ultimately constrained by what the user’s browser can decode from the selected media file.
- Node.js 20+ recommended
- npm
cd frontend/app
npm ci
npm run dev -- --host 0.0.0.0 --port 5173docker compose build
docker compose upThis starts the frontend container and serves the Vite app on port 5173.
cd frontend/app
npm ci
npm test -- --ci --runInBand --coverage --verbose# Build the image
docker compose build
# Run the container
docker compose up
docker compose \
-f docker-compose.test.yml up \
--build --exit-code-from \
frontend_test- Model loading happens in the browser and may take time on first use
- Larger models use more memory
- Transcription speed depends on the browser and device
- Media decoding support depends on browser codec support
- The current app has no backend transcription service; transcription is performed client-side
- Apache License 2.0
