Skip to content

allenhutchison/podcast-rag

Repository files navigation

Podcast Transcription using Whisper

This project provides a Python-based tool to automate the transcription of podcasts using the Whisper model. The tool processes directories of podcast MP3 files, transcribes them, and outputs the results in a text format.

Features

  • Download podcasts directly from RSS feeds
  • Batch transcription of MP3 files
  • Dry-run mode to preview files without performing transcription or downloads
  • Logging for detailed process tracking
  • Support for environment configuration using .env

Installation

To use this tool, you'll need to set up a Python environment with the required dependencies and install Whisper for transcription.

Prerequisites

  1. Python 3.8+

  2. Whisper

  3. Install ffmpeg:

  4. Install dependencies:

pip install -r requirements.txt

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/podcast-transcription
cd podcast-transcription
  1. Set up the environment variables by creating a .env file, or export them directly:
export MEDIA_EMBED_BASE_DIRECTORY="/path/to/your/podcasts"
export MEDIA_EMBED_WHISPER_PATH="/path/to/whisper"
  1. Run the transcription tool in dry-run mode:
python transcribe_podcasts.py --dry-run

Configuration

The configuration is managed via environment variables:

  • MEDIA_EMBED_BASE_DIRECTORY: Base directory containing podcast subdirectories.
  • MEDIA_EMBED_WHISPER_PATH: Path to the Whisper binary.

The default values can be found in the config.py file.

Usage

To download and transcribe podcasts:

python src/download_and_transcribe.py --feed https://feeds.megaphone.fm/darknetdiaries

To download podcasts from a list of RSS feeds:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt

To perform a dry run:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --dry-run

To only download podcasts without transcribing:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-transcription

To only transcribe existing podcasts without downloading:

python src/download_and_transcribe.py --skip-download

To download only the latest 3 episodes from each feed:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --limit 3

To skip ChromaDB vector database operations:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-vectordb

To run only the transcription tool:

python src/transcribe_podcasts.py

Podcast Downloading

The podcast downloader can be used independently of the transcription system:

python src/podcast_downloader.py --feed https://example.com/podcast.xml

Features of the podcast downloader:

  • Downloads episodes directly from RSS feeds
  • Automatically organizes podcasts into directories by podcast name
  • Preserves episode metadata (ID3 tags)
  • Can limit downloads to the most recent episodes
  • Can filter episodes by publication date

Logging

The tool uses Python's built-in logging for tracking progress and errors. By default, logs are displayed in the console, but this can be easily modified to output to a file.

Testing

Unit tests can be run using pytest. To install pytest:

pip install pytest

To run the tests:

pytest

Contributing

Contributions are welcome! Please submit a pull request with any improvements or bug fixes. Ensure all tests pass before submitting your PR.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

About

Using a set of MP3 podcasts, create a RAG System to work with the model of your choice.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published