This project provides a Python-based tool to automate the transcription of podcasts using the Whisper model. The tool processes directories of podcast MP3 files, transcribes them, and outputs the results in a text format.
- Download podcasts directly from RSS feeds
- Batch transcription of MP3 files
- Dry-run mode to preview files without performing transcription or downloads
- Logging for detailed process tracking
- Support for environment configuration using
.env
To use this tool, you'll need to set up a Python environment with the required dependencies and install Whisper for transcription.
-
Python 3.8+
-
Install
ffmpeg
:- Linux:
sudo apt-get install ffmpeg
- macOS (using Homebrew):
brew install ffmpeg
- Windows:
- Download and install
ffmpeg
from https://ffmpeg.org/download.html.
- Download and install
- Linux:
-
Install dependencies:
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/yourusername/podcast-transcription
cd podcast-transcription
- Set up the environment variables by creating a
.env
file, or export them directly:
export MEDIA_EMBED_BASE_DIRECTORY="/path/to/your/podcasts"
export MEDIA_EMBED_WHISPER_PATH="/path/to/whisper"
- Run the transcription tool in dry-run mode:
python transcribe_podcasts.py --dry-run
The configuration is managed via environment variables:
MEDIA_EMBED_BASE_DIRECTORY
: Base directory containing podcast subdirectories.MEDIA_EMBED_WHISPER_PATH
: Path to the Whisper binary.
The default values can be found in the config.py
file.
python src/download_and_transcribe.py --feed https://feeds.megaphone.fm/darknetdiaries
python src/download_and_transcribe.py --feed-file podcast_feeds.txt
python src/download_and_transcribe.py --feed-file podcast_feeds.txt --dry-run
python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-transcription
python src/download_and_transcribe.py --skip-download
python src/download_and_transcribe.py --feed-file podcast_feeds.txt --limit 3
python src/download_and_transcribe.py --feed-file podcast_feeds.txt --skip-vectordb
python src/transcribe_podcasts.py
The podcast downloader can be used independently of the transcription system:
python src/podcast_downloader.py --feed https://example.com/podcast.xml
Features of the podcast downloader:
- Downloads episodes directly from RSS feeds
- Automatically organizes podcasts into directories by podcast name
- Preserves episode metadata (ID3 tags)
- Can limit downloads to the most recent episodes
- Can filter episodes by publication date
The tool uses Python's built-in logging for tracking progress and errors. By default, logs are displayed in the console, but this can be easily modified to output to a file.
Unit tests can be run using pytest
. To install pytest
:
pip install pytest
To run the tests:
pytest
Contributions are welcome! Please submit a pull request with any improvements or bug fixes. Ensure all tests pass before submitting your PR.
This project is licensed under the Apache 2.0 License. See the LICENSE
file for details.