Skip to content

A searchable archive system for oral history audio collections

License

Notifications You must be signed in to change notification settings

jasontitus/storylog

Repository files navigation

StoryLog

A searchable archive system for oral history audio collections. StoryLog transcribes audio files using MLX Whisper, creates semantic search indices, and provides a web interface for discovering and playing oral history interviews.

Features

  • Audio Transcription: Automatic transcription of oral history audio files using MLX Whisper (large-v3 model)
  • Semantic Search: Multi-level chunking and semantic search using sentence transformers for finding relevant content
  • Keyword Search: Traditional keyword-based search for exact phrase matching
  • Metadata Search: Full-text search across collection metadata (titles, contributors, descriptions, subjects)
  • Audio Playback: Web-based audio player with automatic volume normalization for quiet recordings
  • Collection Crawling: Tools for downloading oral history collections from Library of Congress and other sources

Architecture

  • Backend: FastAPI server with semantic search (sentence transformers) and metadata indexing (Whoosh)
  • Frontend: Single-page web application with Tailwind CSS
  • Transcription: MLX Whisper for efficient GPU-accelerated transcription
  • Audio Processing: FFmpeg for audio cleaning and normalization

Requirements

  • Python 3.9+
  • FFmpeg (for audio processing)
  • MLX-compatible hardware (Apple Silicon) or CUDA GPU (for transcription)
  • PyTorch with MPS/CUDA support (for semantic search)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/storylog.git
cd storylog
  1. Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
  1. Download the MLX Whisper model (automatically downloaded on first use):
    • Model: mlx-community/whisper-large-v3-mlx

Usage

1. Download Oral History Collections

Use the crawler scripts to download collections from Library of Congress:

# Crawl American Folklife Center oral histories
python crawl.py

# Crawl general oral histories
python crawl_oral_histories.py

# Crawl specific collections (see individual crawl scripts)
python crawl_voices_of_slavery.py

Audio files and metadata will be saved to the stories/ directory (or collection-specific directories).

2. Transcribe Audio Files

Transcribe all audio files in the stories directory:

python transcribe.py

Options:

  • --languages en,es: Only keep transcripts in specified languages (default: en,es)
  • --skip-non-language: Skip saving transcripts unless detected language matches
  • --keep-clean-audio: Keep cleaned audio WAV files (uses significant disk space)

The transcription process:

  • Cleans audio using FFmpeg (high-pass/low-pass filtering based on quality)
  • Detects language from first 30 seconds
  • Transcribes using MLX Whisper
  • Filters out repetitive junk and hallucinations
  • Saves transcripts to transcripts/ directory

3. Build Search Indices

Create semantic and metadata search indices:

# Build semantic index with default settings
python index.py

# Customize chunking parameters
python index.py --target-words 75 --overlap-words 25 --small-chunk-words 15

# Use a different sentence transformer model
python index.py --model sentence-transformers/all-MiniLM-L6-v2

# Disable multi-level chunking (single chunk size only)
python index.py --no-multi-level

The indexing process:

  • Creates semantic embeddings for transcript chunks
  • Builds Whoosh full-text index for metadata
  • Supports multi-level chunking (small chunks for precision, large chunks for context)

4. Run the Web Server

Start the FastAPI server:

python main.py

Or with custom options:

python main.py --model BAAI/bge-large-en-v1.5 --device mps

The server will start at http://127.0.0.1:8000

5. Search and Explore

Open your browser to http://127.0.0.1:8000 and use the search interface:

  • Semantic Search: Find content by meaning (e.g., "stories about immigration")
  • Keyword Search: Find exact phrases in transcripts
  • Metadata Search: Search by title, contributor, description, or subjects

Click on any result to play the audio at the relevant timestamp.

Project Structure

storylog/
├── transcribe.py          # Audio transcription using MLX Whisper
├── index.py               # Build semantic and metadata search indices
├── main.py               # FastAPI web server
├── crawl.py              # Crawler for Library of Congress collections
├── crawl_oral_histories.py
├── crawl_voices_of_slavery.py
├── backfill_audio_levels.py
├── stories/              # Downloaded audio files and metadata
├── transcripts/          # Generated transcript JSON files
├── metadata_index/       # Whoosh full-text index
├── search_index.pkl      # Semantic search embeddings
├── static/               # Web UI (index.html)
└── requirements.txt      # Python dependencies

Configuration

Environment Variables

  • SEMANTIC_MODEL: Sentence transformer model for semantic search (default: BAAI/bge-large-en-v1.5)
  • DEVICE: Override device selection (mps, cuda, or cpu)

Model Selection

Transcription Model: Set in transcribe.py:

  • Default: mlx-community/whisper-large-v3-mlx (high quality, slower)
  • Alternative: mlx-community/whisper-small-mlx (faster, lower quality)

Semantic Search Model: Set via --model flag or SEMANTIC_MODEL env var:

  • Default: BAAI/bge-large-en-v1.5 (high quality, larger)
  • Alternative: sentence-transformers/all-MiniLM-L6-v2 (faster, smaller)

Audio Processing

The transcription pipeline includes intelligent audio preprocessing:

  • High-quality audio (≥48kHz, ≥24-bit): Minimal filtering, preserves full frequency range
  • Low-quality audio: Aggressive filtering to remove artifacts (high-pass 200Hz, low-pass 3500Hz)
  • Volume normalization: Analyzes audio levels for automatic playback normalization
  • Format conversion: All audio converted to 16kHz mono WAV for Whisper

Search Features

Multi-Level Chunking

The semantic index uses multi-level chunking for optimal search results:

  • Small chunks (default: 15 words): Precise matches, exact phrase finding
  • Large chunks (default: 75 words): Broader context, semantic understanding
  • Overlap: Sliding window prevents missing content at chunk boundaries

Search Modes

  1. Semantic: Uses cosine similarity on sentence embeddings - finds content by meaning
  2. Keyword: Simple substring matching in transcript text
  3. Metadata: Full-text search across collection metadata fields

API Endpoints

  • GET /search?q=<query>&mode=<semantic|keyword|metadata>&field=<all|title|contributors|description|subjects>
  • GET /transcript/<file_name>: Get transcript data with audio levels
  • GET /metadata/<item_id>: Get full metadata for an item
  • GET /stories/<path>: Serve audio files
  • GET /: Web UI

Performance Notes

  • Transcription: ~1-2x real-time on Apple Silicon (M1/M2/M3)
  • Indexing: Embedding generation is the bottleneck; use GPU when available
  • Search: Semantic search is fast (<100ms) with pre-computed embeddings
  • Memory: Large collections may require significant RAM for embeddings

Data Sources

This project is designed to work with oral history collections from:

  • Library of Congress (American Folklife Center, Veterans History Project, etc.)
  • Other institutions with similar API structures

Note: Always respect terms of service and rate limits when crawling. The included crawlers implement rate limiting and respectful scraping practices.

License

This project is provided as-is for educational and research purposes. Oral history content remains the property of their respective institutions and creators.

Contributing

Contributions welcome! Please open an issue or pull request.

Acknowledgments

  • MLX Whisper for efficient transcription
  • Sentence Transformers for semantic search
  • Library of Congress for making oral history collections available
  • FastAPI and Whoosh for the search infrastructure

About

A searchable archive system for oral history audio collections

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published