Skip to content

europanite/client_side_audio_transcription

License OS CI docker pages

React Jest Vite

🇺🇸 English | 🇮🇳 हिंदी | 🇯🇵 日本語 | 🇨🇳 简体中文 | 🇪🇸 Español | 🇧🇷 Português (Brasil) | 🇰🇷 한국어 | 🇩🇪 Deutsch | 🇫🇷 Français

"web_ui"

PlayGround

A browser-based AI transcription playground powered by Whisper and Transformers.js. No installation, registration, or payment required.


🚀 Overview

This project is a client-side transcription web app built with React, TypeScript, and Vite. It runs Whisper directly in the browser through @huggingface/transformers, so media files are processed locally instead of being uploaded to a backend for transcription.

The current implementation supports selecting a Whisper model in the UI, choosing a local media file, loading the selected model on demand, and displaying the recognized text in a read-only transcript area.

✨ Features

  • Client-side speech-to-text
    The React app calls the automatic-speech-recognition pipeline from @huggingface/transformers directly in the browser, so transcription runs entirely on the client.

  • Simple 3-step workflow
    The UI guides you through:

    1. Loading the Whisper model.
    2. Checking model status.
    3. Uploading audio and running transcription, with clear status messages for each step.
  • In-browser transcription with @huggingface/transformers

  • Multilingual Whisper model selection in the UI

  • Supported built-in model options:

    • Xenova/whisper-tiny
    • Xenova/whisper-base
    • Xenova/whisper-small
  • Client-side audio decoding to 16 kHz via AudioContext

  • Stereo-to-mono mixing before inference

  • Chunked transcription settings for longer media:

    • chunk_length_s: 20
    • stride_length_s: 5
  • File input accepts:

    • audio/*
    • video/mp4
    • video/webm
    • video/ogg
    • .mp4
    • .webm
    • .ogv
    • .m4v

🧱 Tech stack

  • Frontend: React + TypeScript + Vite
  • ML runtime: @huggingface/transformers
  • Inference task: automatic-speech-recognition
  • Browser audio handling: Web Audio API (AudioContext)
  • Testing: Jest + Testing Library
  • Container tooling: Docker + Docker Compose

How it works

1. App layout

App.tsx renders the app shell, title, subtitle, SettingsBar, and HomeScreen.

The settings bar currently displays the runtime summary:

  • Transformers.js + Whisper

2. Model and file selection

HomeScreen.tsx provides a 3-step UI:

  1. Choose a model and media file
  2. Check model status
  3. Read the transcription result

The screen includes:

  • A Whisper model dropdown
  • A hidden file input triggered by a button
  • Status text and spinner while processing
  • A transcript textarea
  • A Clear button

3. Transcription hook

useTranscription.ts is the core implementation.

It exposes:

  • status
  • error
  • transcript
  • availableModels
  • selectedModelId
  • setSelectedModelId(modelId)
  • transcribeFile(file)
  • reset()

Behavior:

  • The selected Whisper model is loaded lazily on first use
  • The pipeline instance is cached and reused if the same model remains selected
  • Browser-friendly ONNX WASM settings are applied before model loading
  • The selected file is read as an ArrayBuffer
  • Audio is decoded with AudioContext({ sampleRate: 16000 })
  • Multi-channel audio is mixed down to mono
  • Whisper runs with automatic language detection because language is intentionally left unset
  • The recognized text is written to the transcript state

4. Status messages

The current UI reports user-facing states such as:

  • idle: choose a model and a file
  • loading: first model load may be slow
  • ready: model loaded and ready
  • transcribing: local browser transcription is running
  • done: transcription finished
  • error: failure message shown below the status block

Supported media notes

The UI text says users can select audio or video files and that Whisper can detect speech from supported media such as MP3 or MP4 in the browser.

However, the actual implementation decodes the selected file using AudioContext.decodeAudioData(). In practice, successful decoding depends on browser codec support. That means supported behavior is ultimately constrained by what the user’s browser can decode from the selected media file.


🚀 Getting Started

Local development

Prerequisites

  • Node.js 20+ recommended
  • npm

Run locally with npm

cd frontend/app
npm ci
npm run dev -- --host 0.0.0.0 --port 5173

Run locally with Docker Compose

docker compose build
docker compose up

This starts the frontend container and serves the Vite app on port 5173.

Testing

Run tests locally

cd frontend/app
npm ci
npm test -- --ci --runInBand --coverage --verbose

docker compose development

Prerequisites

Build and start all services:

# Build the image
docker compose build

# Run the container
docker compose up

Test:

docker compose \
-f docker-compose.test.yml up \
--build --exit-code-from \
frontend_test

Notes and limitations

  • Model loading happens in the browser and may take time on first use
  • Larger models use more memory
  • Transcription speed depends on the browser and device
  • Media decoding support depends on browser codec support
  • The current app has no backend transcription service; transcription is performed client-side

License

  • Apache License 2.0