StoryLog

A searchable archive system for oral history audio collections. StoryLog transcribes audio files using MLX Whisper, creates semantic search indices, and provides a web interface for discovering and playing oral history interviews.

Features

Audio Transcription: Automatic transcription of oral history audio files using MLX Whisper (large-v3 model)
Semantic Search: Multi-level chunking and semantic search using sentence transformers for finding relevant content
Keyword Search: Traditional keyword-based search for exact phrase matching
Metadata Search: Full-text search across collection metadata (titles, contributors, descriptions, subjects)
Audio Playback: Web-based audio player with automatic volume normalization for quiet recordings
Collection Crawling: Tools for downloading oral history collections from Library of Congress and other sources

Architecture

Backend: FastAPI server with semantic search (sentence transformers) and metadata indexing (Whoosh)
Frontend: Single-page web application with Tailwind CSS
Transcription: MLX Whisper for efficient GPU-accelerated transcription
Audio Processing: FFmpeg for audio cleaning and normalization

Requirements

Python 3.9+
FFmpeg (for audio processing)
MLX-compatible hardware (Apple Silicon) or CUDA GPU (for transcription)
PyTorch with MPS/CUDA support (for semantic search)

Installation

Clone the repository:

git clone https://github.com/yourusername/storylog.git
cd storylog

Create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Download the MLX Whisper model (automatically downloaded on first use):
- Model: mlx-community/whisper-large-v3-mlx

Usage

1. Download Oral History Collections

Use the crawler scripts to download collections from Library of Congress:

# Crawl American Folklife Center oral histories
python crawl.py

# Crawl general oral histories
python crawl_oral_histories.py

# Crawl specific collections (see individual crawl scripts)
python crawl_voices_of_slavery.py

Audio files and metadata will be saved to the stories/ directory (or collection-specific directories).

2. Transcribe Audio Files

Transcribe all audio files in the stories directory:

python transcribe.py

Options:

--languages en,es: Only keep transcripts in specified languages (default: en,es)
--skip-non-language: Skip saving transcripts unless detected language matches
--keep-clean-audio: Keep cleaned audio WAV files (uses significant disk space)

The transcription process:

Cleans audio using FFmpeg (high-pass/low-pass filtering based on quality)
Detects language from first 30 seconds
Transcribes using MLX Whisper
Filters out repetitive junk and hallucinations
Saves transcripts to transcripts/ directory

3. Build Search Indices

Create semantic and metadata search indices:

# Build semantic index with default settings
python index.py

# Customize chunking parameters
python index.py --target-words 75 --overlap-words 25 --small-chunk-words 15

# Use a different sentence transformer model
python index.py --model sentence-transformers/all-MiniLM-L6-v2

# Disable multi-level chunking (single chunk size only)
python index.py --no-multi-level

The indexing process:

Creates semantic embeddings for transcript chunks
Builds Whoosh full-text index for metadata
Supports multi-level chunking (small chunks for precision, large chunks for context)

4. Run the Web Server

Start the FastAPI server:

python main.py

Or with custom options:

python main.py --model BAAI/bge-large-en-v1.5 --device mps

The server will start at http://127.0.0.1:8000

5. Search and Explore

Open your browser to http://127.0.0.1:8000 and use the search interface:

Semantic Search: Find content by meaning (e.g., "stories about immigration")
Keyword Search: Find exact phrases in transcripts
Metadata Search: Search by title, contributor, description, or subjects

Click on any result to play the audio at the relevant timestamp.

Project Structure

storylog/
├── transcribe.py          # Audio transcription using MLX Whisper
├── index.py               # Build semantic and metadata search indices
├── main.py               # FastAPI web server
├── crawl.py              # Crawler for Library of Congress collections
├── crawl_oral_histories.py
├── crawl_voices_of_slavery.py
├── backfill_audio_levels.py
├── stories/              # Downloaded audio files and metadata
├── transcripts/          # Generated transcript JSON files
├── metadata_index/       # Whoosh full-text index
├── search_index.pkl      # Semantic search embeddings
├── static/               # Web UI (index.html)
└── requirements.txt      # Python dependencies

Configuration

Environment Variables

SEMANTIC_MODEL: Sentence transformer model for semantic search (default: BAAI/bge-large-en-v1.5)
DEVICE: Override device selection (mps, cuda, or cpu)

Model Selection

Transcription Model: Set in transcribe.py:

Default: mlx-community/whisper-large-v3-mlx (high quality, slower)
Alternative: mlx-community/whisper-small-mlx (faster, lower quality)

Semantic Search Model: Set via --model flag or SEMANTIC_MODEL env var:

Default: BAAI/bge-large-en-v1.5 (high quality, larger)
Alternative: sentence-transformers/all-MiniLM-L6-v2 (faster, smaller)

Audio Processing

The transcription pipeline includes intelligent audio preprocessing:

High-quality audio (≥48kHz, ≥24-bit): Minimal filtering, preserves full frequency range
Low-quality audio: Aggressive filtering to remove artifacts (high-pass 200Hz, low-pass 3500Hz)
Volume normalization: Analyzes audio levels for automatic playback normalization
Format conversion: All audio converted to 16kHz mono WAV for Whisper

Search Features

Multi-Level Chunking

The semantic index uses multi-level chunking for optimal search results:

Small chunks (default: 15 words): Precise matches, exact phrase finding
Large chunks (default: 75 words): Broader context, semantic understanding
Overlap: Sliding window prevents missing content at chunk boundaries

Search Modes

Semantic: Uses cosine similarity on sentence embeddings - finds content by meaning
Keyword: Simple substring matching in transcript text
Metadata: Full-text search across collection metadata fields

API Endpoints

GET /search?q=<query>&mode=<semantic|keyword|metadata>&field=<all|title|contributors|description|subjects>
GET /transcript/<file_name>: Get transcript data with audio levels
GET /metadata/<item_id>: Get full metadata for an item
GET /stories/<path>: Serve audio files
GET /: Web UI

Performance Notes

Transcription: ~1-2x real-time on Apple Silicon (M1/M2/M3)
Indexing: Embedding generation is the bottleneck; use GPU when available
Search: Semantic search is fast (<100ms) with pre-computed embeddings
Memory: Large collections may require significant RAM for embeddings

Data Sources

This project is designed to work with oral history collections from:

Library of Congress (American Folklife Center, Veterans History Project, etc.)
Other institutions with similar API structures

Note: Always respect terms of service and rate limits when crawling. The included crawlers implement rate limiting and respectful scraping practices.

License

This project is provided as-is for educational and research purposes. Oral history content remains the property of their respective institutions and creators.

Contributing

Contributions welcome! Please open an issue or pull request.

Acknowledgments

MLX Whisper for efficient transcription
Sentence Transformers for semantic search
Library of Congress for making oral history collections available
FastAPI and Whoosh for the search infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StoryLog

Features

Architecture

Requirements

Installation

Usage

1. Download Oral History Collections

2. Transcribe Audio Files

3. Build Search Indices

4. Run the Web Server

5. Search and Explore

Project Structure

Configuration

Environment Variables

Model Selection

Audio Processing

Search Features

Multi-Level Chunking

Search Modes

API Endpoints

Performance Notes

Data Sources

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backfill_audio_levels.py		backfill_audio_levels.py
crawl.py		crawl.py
crawl_oral_histories.py		crawl_oral_histories.py
crawl_voices_of_slavery.py		crawl_voices_of_slavery.py
index.py		index.py
main.py		main.py
requirements.txt		requirements.txt
stories		stories
transcribe.py		transcribe.py

License

jasontitus/storylog

Folders and files

Latest commit

History

Repository files navigation

StoryLog

Features

Architecture

Requirements

Installation

Usage

1. Download Oral History Collections

2. Transcribe Audio Files

3. Build Search Indices

4. Run the Web Server

5. Search and Explore

Project Structure

Configuration

Environment Variables

Model Selection

Audio Processing

Search Features

Multi-Level Chunking

Search Modes

API Endpoints

Performance Notes

Data Sources

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages