A searchable archive system for oral history audio collections. StoryLog transcribes audio files using MLX Whisper, creates semantic search indices, and provides a web interface for discovering and playing oral history interviews.
- Audio Transcription: Automatic transcription of oral history audio files using MLX Whisper (large-v3 model)
- Semantic Search: Multi-level chunking and semantic search using sentence transformers for finding relevant content
- Keyword Search: Traditional keyword-based search for exact phrase matching
- Metadata Search: Full-text search across collection metadata (titles, contributors, descriptions, subjects)
- Audio Playback: Web-based audio player with automatic volume normalization for quiet recordings
- Collection Crawling: Tools for downloading oral history collections from Library of Congress and other sources
- Backend: FastAPI server with semantic search (sentence transformers) and metadata indexing (Whoosh)
- Frontend: Single-page web application with Tailwind CSS
- Transcription: MLX Whisper for efficient GPU-accelerated transcription
- Audio Processing: FFmpeg for audio cleaning and normalization
- Python 3.9+
- FFmpeg (for audio processing)
- MLX-compatible hardware (Apple Silicon) or CUDA GPU (for transcription)
- PyTorch with MPS/CUDA support (for semantic search)
- Clone the repository:
git clone https://github.com/yourusername/storylog.git
cd storylog- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt- Download the MLX Whisper model (automatically downloaded on first use):
- Model:
mlx-community/whisper-large-v3-mlx
- Model:
Use the crawler scripts to download collections from Library of Congress:
# Crawl American Folklife Center oral histories
python crawl.py
# Crawl general oral histories
python crawl_oral_histories.py
# Crawl specific collections (see individual crawl scripts)
python crawl_voices_of_slavery.pyAudio files and metadata will be saved to the stories/ directory (or collection-specific directories).
Transcribe all audio files in the stories directory:
python transcribe.pyOptions:
--languages en,es: Only keep transcripts in specified languages (default: en,es)--skip-non-language: Skip saving transcripts unless detected language matches--keep-clean-audio: Keep cleaned audio WAV files (uses significant disk space)
The transcription process:
- Cleans audio using FFmpeg (high-pass/low-pass filtering based on quality)
- Detects language from first 30 seconds
- Transcribes using MLX Whisper
- Filters out repetitive junk and hallucinations
- Saves transcripts to
transcripts/directory
Create semantic and metadata search indices:
# Build semantic index with default settings
python index.py
# Customize chunking parameters
python index.py --target-words 75 --overlap-words 25 --small-chunk-words 15
# Use a different sentence transformer model
python index.py --model sentence-transformers/all-MiniLM-L6-v2
# Disable multi-level chunking (single chunk size only)
python index.py --no-multi-levelThe indexing process:
- Creates semantic embeddings for transcript chunks
- Builds Whoosh full-text index for metadata
- Supports multi-level chunking (small chunks for precision, large chunks for context)
Start the FastAPI server:
python main.pyOr with custom options:
python main.py --model BAAI/bge-large-en-v1.5 --device mpsThe server will start at http://127.0.0.1:8000
Open your browser to http://127.0.0.1:8000 and use the search interface:
- Semantic Search: Find content by meaning (e.g., "stories about immigration")
- Keyword Search: Find exact phrases in transcripts
- Metadata Search: Search by title, contributor, description, or subjects
Click on any result to play the audio at the relevant timestamp.
storylog/
├── transcribe.py # Audio transcription using MLX Whisper
├── index.py # Build semantic and metadata search indices
├── main.py # FastAPI web server
├── crawl.py # Crawler for Library of Congress collections
├── crawl_oral_histories.py
├── crawl_voices_of_slavery.py
├── backfill_audio_levels.py
├── stories/ # Downloaded audio files and metadata
├── transcripts/ # Generated transcript JSON files
├── metadata_index/ # Whoosh full-text index
├── search_index.pkl # Semantic search embeddings
├── static/ # Web UI (index.html)
└── requirements.txt # Python dependencies
SEMANTIC_MODEL: Sentence transformer model for semantic search (default:BAAI/bge-large-en-v1.5)DEVICE: Override device selection (mps,cuda, orcpu)
Transcription Model: Set in transcribe.py:
- Default:
mlx-community/whisper-large-v3-mlx(high quality, slower) - Alternative:
mlx-community/whisper-small-mlx(faster, lower quality)
Semantic Search Model: Set via --model flag or SEMANTIC_MODEL env var:
- Default:
BAAI/bge-large-en-v1.5(high quality, larger) - Alternative:
sentence-transformers/all-MiniLM-L6-v2(faster, smaller)
The transcription pipeline includes intelligent audio preprocessing:
- High-quality audio (≥48kHz, ≥24-bit): Minimal filtering, preserves full frequency range
- Low-quality audio: Aggressive filtering to remove artifacts (high-pass 200Hz, low-pass 3500Hz)
- Volume normalization: Analyzes audio levels for automatic playback normalization
- Format conversion: All audio converted to 16kHz mono WAV for Whisper
The semantic index uses multi-level chunking for optimal search results:
- Small chunks (default: 15 words): Precise matches, exact phrase finding
- Large chunks (default: 75 words): Broader context, semantic understanding
- Overlap: Sliding window prevents missing content at chunk boundaries
- Semantic: Uses cosine similarity on sentence embeddings - finds content by meaning
- Keyword: Simple substring matching in transcript text
- Metadata: Full-text search across collection metadata fields
GET /search?q=<query>&mode=<semantic|keyword|metadata>&field=<all|title|contributors|description|subjects>GET /transcript/<file_name>: Get transcript data with audio levelsGET /metadata/<item_id>: Get full metadata for an itemGET /stories/<path>: Serve audio filesGET /: Web UI
- Transcription: ~1-2x real-time on Apple Silicon (M1/M2/M3)
- Indexing: Embedding generation is the bottleneck; use GPU when available
- Search: Semantic search is fast (<100ms) with pre-computed embeddings
- Memory: Large collections may require significant RAM for embeddings
This project is designed to work with oral history collections from:
- Library of Congress (American Folklife Center, Veterans History Project, etc.)
- Other institutions with similar API structures
Note: Always respect terms of service and rate limits when crawling. The included crawlers implement rate limiting and respectful scraping practices.
This project is provided as-is for educational and research purposes. Oral history content remains the property of their respective institutions and creators.
Contributions welcome! Please open an issue or pull request.
- MLX Whisper for efficient transcription
- Sentence Transformers for semantic search
- Library of Congress for making oral history collections available
- FastAPI and Whoosh for the search infrastructure