Podcast Transcription using Whisper

This project provides a Python-based tool to automate the transcription of podcasts using the Whisper model. The tool processes directories of podcast MP3 files, transcribes them, and outputs the results in a text format.

Features

Download podcasts directly from RSS feeds
Batch transcription of MP3 files
Dry-run mode to preview files without performing transcription or downloads
Logging for detailed process tracking
Support for environment configuration using .env

Installation

To use this tool, you'll need to set up a Python environment with the required dependencies and install Whisper for transcription.

Prerequisites

Python 3.8+
Whisper
Install ffmpeg:
- Linux:
```
sudo apt-get install ffmpeg
```
- macOS (using Homebrew):
```
brew install ffmpeg
```
- Windows:
  - Download and install ffmpeg from https://ffmpeg.org/download.html.
Install dependencies:

pip install -r requirements.txt

Setup

Clone the repository:

git clone https://github.com/yourusername/podcast-transcription
cd podcast-transcription

Set up the environment variables by creating a .env file, or export them directly:

export MEDIA_EMBED_BASE_DIRECTORY="/path/to/your/podcasts"
export MEDIA_EMBED_WHISPER_PATH="/path/to/whisper"

Run the transcription tool in dry-run mode:

python transcribe_podcasts.py --dry-run

Configuration

The configuration is managed via environment variables:

MEDIA_EMBED_BASE_DIRECTORY: Base directory containing podcast subdirectories.
MEDIA_EMBED_WHISPER_PATH: Path to the Whisper binary.

The default values can be found in the config.py file.

Usage

To download and transcribe podcasts:

python src/download_and_transcribe.py --feed https://feeds.megaphone.fm/darknetdiaries

To download podcasts from a list of RSS feeds:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt

To perform a dry run:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --dry-run

To only download podcasts without transcribing:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-transcription

To only transcribe existing podcasts without downloading:

python src/download_and_transcribe.py --skip-download

To download only the latest 3 episodes from each feed:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --limit 3

To skip ChromaDB vector database operations:

python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-vectordb

To run only the transcription tool:

python src/transcribe_podcasts.py

Podcast Downloading

The podcast downloader can be used independently of the transcription system:

python src/podcast_downloader.py --feed https://example.com/podcast.xml

Features of the podcast downloader:

Downloads episodes directly from RSS feeds
Automatically organizes podcasts into directories by podcast name
Preserves episode metadata (ID3 tags)
Can limit downloads to the most recent episodes
Can filter episodes by publication date

Logging

The tool uses Python's built-in logging for tracking progress and errors. By default, logs are displayed in the console, but this can be easily modified to output to a file.

Testing

Unit tests can be run using pytest. To install pytest:

pip install pytest

To run the tests:

pytest

Contributing

Contributions are welcome! Please submit a pull request with any improvements or bug fixes. Ensure all tests pass before submitting your PR.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
prompts		prompts
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
podgrab-export.opml		podgrab-export.opml
requirements.txt		requirements.txt
test_podcasts.opml		test_podcasts.opml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Podcast Transcription using Whisper

Features

Installation

Prerequisites

Setup

Configuration

Usage

To download and transcribe podcasts:

To download podcasts from a list of RSS feeds:

To perform a dry run:

To only download podcasts without transcribing:

To only transcribe existing podcasts without downloading:

To download only the latest 3 episodes from each feed:

To skip ChromaDB vector database operations:

To run only the transcription tool:

Podcast Downloading

Logging

Testing

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

allenhutchison/podcast-rag

Folders and files

Latest commit

History

Repository files navigation

Podcast Transcription using Whisper

Features

Installation

Prerequisites

Setup

Configuration

Usage

To download and transcribe podcasts:

To download podcasts from a list of RSS feeds:

To perform a dry run:

To only download podcasts without transcribing:

To only transcribe existing podcasts without downloading:

To download only the latest 3 episodes from each feed:

To skip ChromaDB vector database operations:

To run only the transcription tool:

Podcast Downloading

Logging

Testing

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages