A local, privacy-first meeting transcription tool for macOS. Captures system audio and microphone, provides live transcription with speaker identification, and generates polished post-meeting transcripts — all running entirely on your Mac with no cloud APIs.
- Live transcription — real-time speech-to-text displayed in your terminal as the meeting happens
- Speaker diarization — identifies up to 4 speakers with color-coded labels, stabilized across chunks
- System audio + mic capture — records both sides of the conversation (what you hear and what you say) using ScreenCaptureKit and AVAudioEngine
- Sentence-level output — each sentence appears on its own line with a precise timestamp
- Smart transcript cleanup — automatically merges fragmented lines into readable speaker-turn paragraphs (instant, no models)
- Auto-summary — generates meeting summary with action items using Claude Code
- Whisper prompt conditioning — feeds previous context to Whisper for better continuity and accuracy
- Hallucination filtering — detects and filters Whisper artifacts using pattern matching, repetition detection, and built-in thresholds
- Audio recording — saves the full meeting audio as a WAV file for reference
- Markdown transcripts — outputs clean, readable markdown files with timestamps and speaker labels
- 100% local — all processing happens on-device using Apple Silicon GPU acceleration. No data leaves your machine.
┌──────────────────────────────────────────────┐
│ Swift CLI (AudioCapture) │
│ ScreenCaptureKit (system audio) │
│ + AVAudioEngine (mic) │
│ → Mixed PCM Float32 16kHz mono → stdout │
└─────────────────────┬────────────────────────┘
│ raw audio pipe
▼
┌──────────────────────────────────────────────┐
│ Python Pipeline (transcriber) │
│ │
│ Audio Buffer → webrtcvad → mlx-whisper (ASR)│
│ (20s chunks) → Sortformer (speakers) │
│ → Prompt conditioning │
│ → Speaker consistency │
│ → Hallucination filtering │
│ → Live terminal display │
│ → Markdown transcript │
│ → WAV file recording │
└─────────────────────┬────────────────────────┘
│ on Ctrl+C
▼
┌──────────────────────────────────────────────┐
│ Cleanup + Summary (text only, instant) │
│ │
│ Merge speaker turns, filter hallucinations │
│ → claude -p for meeting summary │
└──────────────────────────────────────────────┘
- Mac with Apple Silicon (M1, M2, M3, M4 — any variant)
- 16GB RAM recommended (8GB may work with the smaller model)
Install these before running:
-
Xcode Command Line Tools (for Swift compilation)
xcode-select --install
-
Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" -
uv (Python package manager)
brew install uv
-
ffmpeg (required by mlx-whisper)
brew install ffmpeg
On first run, macOS will prompt for these permissions. You can also grant them in advance:
-
Screen Recording — required for system audio capture
- System Settings → Privacy & Security → Screen Recording
- Add your terminal app (Terminal, iTerm2, Warp, etc.)
- Restart your terminal after granting
-
Microphone — required for mic capture
- System Settings → Privacy & Security → Microphone
- Will be prompted automatically on first run
# Clone the repo
git clone <repo-url>
cd meeting-assistant
# Build the Swift audio capture CLI (first time only)
cd AudioCapture && swift build -c release && cd ..
# Run a meeting transcription
./run.sh my-meetingThat's it. The first run will download the AI models (~1.5GB for Whisper, ~200MB for Sortformer) which takes a few minutes. Subsequent runs start in seconds.
# Start transcribing (creates timestamped name automatically)
./run.sh
# Start with a custom meeting name
./run.sh "weekly-standup"
# Start with a custom meeting name
./run.sh "client-review-2024"-
Live phase — transcription appears in your terminal in real-time:
[00:00:05] Speaker 1: Let's start with the status update. [00:00:12] Speaker 2: Sure, the API work is on track. [00:00:18] Speaker 2: We should have it done by Friday. [00:00:24] Speaker 1: Great. What about the frontend? -
Ctrl+C — stops recording, saves the live transcript
-
Cleanup — automatically merges fragmented lines into readable speaker-turn paragraphs (instant, no models)
-
Summary — generates a meeting summary using
claude -p -
Done — all files saved to
~/Documents/Transcripts/<meeting-name>/
After a meeting, you'll find these files:
~/Documents/Transcripts/my-meeting/
├── transcript.md # Live transcript (raw, per-segment output)
├── transcript_clean.md # Cleaned transcript (merged speaker turns, filtered)
├── audio.wav # Full meeting audio (~115 MB/hour)
└── summary.md # AI-generated meeting summary
Use transcript_clean.md as your primary reference — it has merged speaker turns and filtered hallucinations.
Run the transcriber directly (without run.sh):
# Pipe audio capture to transcriber
./AudioCapture/.build/release/AudioCapture --output stdout 2>/dev/null | \
uv run --project transcriber python transcriber/transcriber.py \
--output /path/to/transcript.md \
--meeting-name "My Meeting"Re-run transcript cleanup:
uv run --project transcriber python transcriber/cleanup_transcript.py \
--input ~/Documents/Transcripts/my-meeting/transcript.md \
--output ~/Documents/Transcripts/my-meeting/transcript_clean.mdTranscribe a pre-recorded WAV file (no live capture):
uv run --project transcriber python transcriber/transcriber.py \
--input-file recording.wav \
--output transcript.mdDisable speaker diarization:
./AudioCapture/.build/release/AudioCapture --output stdout 2>/dev/null | \
uv run --project transcriber python transcriber/transcriber.py \
--no-diarize --output transcript.mdDon't save audio (saves disk space):
./AudioCapture/.build/release/AudioCapture --output stdout 2>/dev/null | \
uv run --project transcriber python transcriber/transcriber.py \
--no-save-audio --output transcript.mdList available microphone devices:
./AudioCapture/.build/release/AudioCapture --list-devicesSelect a specific microphone:
./AudioCapture/.build/release/AudioCapture --mic-device "MacBook Pro Microphone" --output stdout| Flag | Default | Description |
|---|---|---|
--output <mode> |
stdout |
Output mode: stdout or pipe |
--sample-rate <hz> |
16000 |
Audio sample rate |
--list-devices |
— | List available input devices |
--mic-device <name> |
system default | Select microphone by name |
--help |
— | Show help |
| Flag | Default | Description |
|---|---|---|
--output <path> |
none | Transcript output file (.md) |
--meeting-name <name> |
"Meeting" | Meeting name for header |
--input-file <path> |
stdin | Read WAV file instead of stdin |
--model <id> |
mlx-community/distil-whisper-large-v3 |
Whisper model |
--energy-threshold <f> |
0.01 |
VAD sensitivity (lower = more sensitive) |
--no-diarize |
— | Disable speaker identification |
--no-save-audio |
— | Don't save audio WAV file |
| Flag | Default | Description |
|---|---|---|
--input <path> |
required | Path to live transcript (.md) |
--output <path> |
required | Output cleaned transcript path (.md) |
--pause-threshold <f> |
15.0 |
Seconds of pause to start new paragraph |
| Model | Purpose | Size | Source |
|---|---|---|---|
| distil-whisper-large-v3 | Speech-to-text | ~1.5 GB | MLX Community |
| Sortformer v2.1 | Speaker diarization | ~200 MB | MLX Community (NVIDIA) |
Both models run locally on Apple Silicon GPU via MLX. Models are downloaded automatically on first run and cached in ~/.cache/huggingface/.
meeting-assistant/
├── AudioCapture/ # Swift CLI for audio capture
│ ├── Package.swift # Swift package config
│ └── Sources/AudioCapture/
│ ├── main.swift # Entry point, signal handling
│ ├── SystemAudioCapture.swift # ScreenCaptureKit system audio
│ ├── MicCapture.swift # AVAudioEngine mic input
│ ├── AudioMixer.swift # Dual-buffer mixing + output
│ └── CLIArguments.swift # CLI argument parsing
├── transcriber/ # Python transcription pipeline
│ ├── pyproject.toml # Python dependencies (uv)
│ ├── transcriber.py # Main live pipeline
│ ├── cleanup_transcript.py # Lightweight text cleanup (no models)
│ ├── audio_reader.py # Threaded audio buffer
│ ├── diarizer.py # Sortformer speaker diarization
│ ├── display.py # Terminal + markdown output
│ ├── vad.py # Voice activity detection (webrtcvad)
│ ├── config.py # Configuration constants
│ └── summary_prompt.md # LLM summary prompt template
└── run.sh # One-command runner
- Speaker limit — Sortformer supports up to 4 simultaneous speakers
- English only — configured for English transcription (change
languageparameter for others) - Apple Silicon required — MLX models only run on Apple Silicon Macs
- macOS 13+ — ScreenCaptureKit requires macOS Ventura or later
- Memory — uses ~4-6 GB RAM during live transcription (Whisper + Sortformer + audio buffers)
"Screen Recording permission denied"
- Grant permission in System Settings → Privacy & Security → Screen Recording
- Add your terminal app and restart it (not just the tab — quit and reopen)
"No module named 'mlx_audio.vad'"
- The diarization module requires the git version of mlx-audio. Run:
cd transcriber && uv sync
"[Errno 2] No such file or directory: 'ffmpeg'"
- Install ffmpeg:
brew install ffmpeg
No audio captured / silent transcript
- Check that your terminal has both Screen Recording and Microphone permissions
- Try
./AudioCapture/.build/release/AudioCapture --list-devicesto verify audio devices - Test with:
./AudioCapture/.build/release/AudioCapture --output stdout 2>/dev/null | head -c 64000 > test.pcm
High memory usage
- The transcriber monitors memory and warns above 12 GB
- Use
--no-diarizeto skip the diarization model (~1-2 GB savings) - Use the smaller Whisper model:
--model mlx-community/distil-whisper-medium.en
MIT