A Web-Based Audio Recorder & Live Transcription Tool
InstantScribe is a lightweight, single-page web application that records audio and transcribes speech in real time directly in the browser. The entire workflow runs locally except for the optional OpenAI-powered refinement features, making it a privacy-preserving utility for interviews, meetings, or personal notes.
InstantScribe leverages the MediaRecorder API for audio recording and the Web Speech API for real-time transcription. In a single HTML file and a few hundred lines of JavaScript, the app provides a fast, privacy-friendly way to capture conversations, interviews, or personal notes directly in your browser.
-
Open the App:
Go to https://cs-util-com.github.io/InstantScribe (or download the index.html and open it locally on your computer)
-
Start Speaking:
Allow microphone access, then start speaking. Watch your words appear in the transcription area in real time. -
Customize Your File:
Enter a custom title (optional) to personalize the file names. -
Save Your Recording:
Click the Save Recording button. This stops the recording, downloads the audio file, and saves the complete transcription as a text file.
The current application lives entirely in index.html. A developer reimplementing it should recreate the behaviors in this section to maintain feature parity.
- Serve a single HTML document that includes Tailwind CSS from the CDN, the
lamejsMP3 encoder, and a module import ofopenai@~4.86.2via esm.sh. - Embed all JavaScript inline. The script registers global helpers and event handlers once
DOMContentLoadedfires. - Rely exclusively on browser-native APIs—no build tooling, bundlers, or background services.
- Render a centered container (
max-w-3xl) with a dark theme. Core elements, in order:- Filename input (
#custom-filename) and a primary button (#stop) that starts labeled “Stop Recording” and switches to “Save Recording” after 10 seconds of capture time. - Language selector (
#language-select) with predefined locale tags. - Primary transcription area (
#transcription-container) containing an editable div (#transcription). - Collapsible high-quality transcription block (
#high-quality-container) with contenteditable output (#high-quality-transcription). Hidden until needed. - Action buttons: a green Whisper trigger (
#whisper-transcribe, doubles as drag-and-drop target) and a blue summarize button (#summarize, hidden until transcription text exists). - API key accordion exposing a password input (
#api-key) and save button (#save-api-key). - Summary heading (
#summary-heading) and container (#summary-container) with editable output (#summary-content), both initially hidden. - “Download AI Texts” button (
#download-ai-texts) inside#download-ai-texts-container, hidden until AI-generated text exists.
- Filename input (
- On load, pull any stored
openai_api_keyfromlocalStorageand populate#api-key. - Register listeners to toggle the API key accordion, persist new keys, and update button visibility when transcription-related content changes.
- Immediately display a dismissible toast (“Recording & Transcription started…”) to prompt user interaction; once the user clicks anywhere, enable
beforeunloadwarnings (exitWarningEnabled). - Verify availability of
MediaRecorderand Web Speech API; if absent, show an alert blocking usage. - Request microphone access via
navigator.mediaDevices.getUserMedia({ audio: true })and kick off the recording/transcription flow when granted.
- Establish global state flags:
isRecording,isInitialRecordingPeriod(true for first 10 seconds), and storage for audio chunks (audioChunks) plus MP3 encoder buffers (mp3Data). - For browsers supporting MP3 in
MediaRecorder, stream data intoaudioChunks; otherwise, instantiateAudioContext+ScriptProcessorNodeand encode MP3 frames manually withlamejs. - After 10 seconds, switch the stop button label to “Save Recording”. If the user stops earlier, simply end the session without saving files (toast: “Recording stopped without saving files”).
- On normal stop (post-initial period), perform these downloads:
- Captured audio as
YYYY-MM-DD_HH-MM-SS [custom title] audio recording.mp3. - Plain-text transcript derived from the editable transcription area (HTML converted to text).
- Any available high-quality transcript and summary (file names mirror the same timestamp/title pattern).
- Captured audio as
- Always stop the speech recognizer, media recorder, MP3 encoder, and microphone tracks during shutdown.
- Instantiate
SpeechRecognitionwithcontinuous = trueandinterimResults = true. - Auto-select the recognition language by matching
navigator.languagewith dropdown options; update when the user changes the dropdown. - Each interim result is rendered inside a semi-transparent
<span id="interim-transcript">and replaced when final text arrives. - On final results, prepend a timestamp line (
[[MM:SS]]) derived from the elapsed time since recording start (value captured when the interim text first appeared) and then append the recognized text node. Removing interim spans ensures clean output. - Resume recognition automatically on
onendwhile recording remains active; recover gracefully ononerrorby restarting recognition when appropriate.
#transcription-containeraccepts plain-text files (.txtor MIMEtext/plain). On drop:- Abort any active recording and transcription.
- Replace current transcription content with the file’s text.
- Show a notification confirming the load.
#whisper-transcribeaccepts audio files with MIME containing “audio”, “mp3”, “mp4”, “mpeg”, or extensions in{mp3,m4a,mp4,wav,ogg,webm}. On drop, stash the file indroppedFile, notify the user, and, if an API key is present, immediately trigger high-quality transcription.- Visual feedback: add/remove dashed border and background classes during drag lifecycle for both zones.
- High-quality transcription (OpenAI 4o Transcribe):
- Requires a non-empty
apiKey; otherwise, show a toast and expand the API key panel. - Works on the most recent audio source: dropped file takes priority, then native
audioChunks, thenmp3Datafallback. - Uses
client.audio.transcriptions.create({ file, model: 'gpt-4o-transcribe', language })by default (override withTRANSCRIBE_MODELif needed) where the language is the dropdown’s two-letter prefix. - Displays progress text while awaiting the result, then populates
#high-quality-transcription, reveals its container, resetsdroppedFile, updates AI-download visibility, and emits a success toast.
- Requires a non-empty
- Summarization:
- Hidden until either the low-quality transcription or the high-quality transcription contains non-whitespace text.
- On click, ensure at least one text source exists and an API key is available.
- Call
client.chat.completions.createwith modelgpt-4o-mini, passing a prompt that instructs the assistant to produce bullet takeaways, a structured paragraph summary, and a full high-quality transcript (prioritizing the Whisper text when present) while explicitly requiring the summary to stay in the same language as the input text via the system message. - Write “Generating summary…” while awaiting the response, then inject the result, reveal summary UI, and update download visibility. On failure, surface the error message in the summary area.
- Download AI Texts button becomes visible when either high-quality transcription or summary content is populated. Downloads individual
.txtfiles for each available artifact using the same timestamp/title naming scheme and confirms via toast.
showNotification(message, duration = 3000)renders a centered toast with fade-out animation. Use this helper for all user feedback (saved key, missing key, invalid drops, errors, success messages).- Avoid stacking logic: each notification is appended to
document.bodyand self-destructs after fading.
- Persist the API key under
localStorage['openai_api_key']. Never expose it elsewhere. - After first user interaction, add a
beforeunloadhandler that, whenisRecordingis true, prompts the user before leaving the page to avoid accidental data loss.
- If media access is denied or unavailable, log the error and alert the user.
- Catch and log OpenAI failures in both Whisper and summarization flows, resetting UI messages appropriately and clearing
droppedFileon Whisper errors. - Ensure recognition restarts automatically after recoverable errors while recording remains active.
- Tailwind CSS (cdn.tailwindcss.com): Provides utility classes for styling without build tooling.
- [email protected]: Supplies an MP3 encoder when browsers lack native MP3 support.
- OpenAI JS SDK (~4.86.2): Exposes
OpenAIclient used for Whisper and summarization. The client must be instantiated with{ apiKey, dangerouslyAllowBrowser: true }to run in the browser. - Browser APIs:
navigator.mediaDevices.getUserMediafor microphone capture.MediaRecorderfor audio chunks (with fallback to Web Audio API +ScriptProcessorNode).- Web Speech API (
SpeechRecognitionorwebkitSpeechRecognition) for live transcription. - DOM drag-and-drop events for file ingestion.
While the current implementation succeeds as a single page, a maintainable evolution could:
- Split UI markup into HTML templates with components for the toolbar, transcription panes, AI actions, and notifications.
- Move business logic into ES modules under
src/(e.g.,src/recording.js,src/transcription.js,src/openai.js,src/ui/notifications.js). - Replace global state with a lightweight state container that exposes observable properties for recording status, API key, and AI outputs.
- Extract configuration (language list, OpenAI prompt text, toast durations) into dedicated modules or JSON files.
- Add unit tests alongside modules to cover edge cases (recording stop states, drag-drop validation, OpenAI request shaping).
- Consider a custom event system or message bus so UI elements only respond to relevant updates, enabling future multi-page or component-based expansions without rewriting core logic.
Contributions, feedback, or suggestions are welcome. To contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a clear description of your changes.