Download BBC podcasts, transcribe with Whisper, and chat with transcripts using Gemini AI and RAG
A complete end-to-end system for downloading, transcribing, and intelligently querying BBC audio content. Built entirely with free and open-source tools β no paid APIs required for transcription!
- π 100% Free Transcription: Uses OpenAI Whisper locally (no API costs)
- π§ Smart RAG System: Semantic search with ChromaDB vector database
- π€ AI-Powered Chat: Query transcripts using Google's Gemini AI
- π¨ Beautiful UI: Intuitive Gradio web interface
- π¦ Modern Stack: Managed with
uvfor fast, reliable dependency management - π Deploy Ready: One-click deployment to HuggingFace Spaces
- Download BBC podcasts via RSS feeds (recommended)
- Support for get_iplayer for BBC iPlayer content
- Batch download multiple episodes
- Popular BBC podcast feeds included
- Powered by OpenAI Whisper (runs on your machine)
- Multiple model sizes:
tiny,base,small,medium,large - No API costs or usage limits
- Batch transcription support
- Automatic audio preprocessing
- Chat with your transcripts using Google Gemini AI
- Retrieval-Augmented Generation (RAG) for accurate answers
- ChromaDB vector database for semantic search
- Source citations for transparency
- Conversation history tracking
- Clean, intuitive Gradio interface
- Three main tabs: Download, Transcribe, Chat
- Real-time progress updates
- File management built-in
- Python 3.9+ - Programming language
- uv - Fast Python package manager and project manager
- Gradio - Web UI framework
- OpenAI Whisper - Speech-to-text transcription (local, free)
- Google Gemini - Large language model for chat (free tier available)
- ChromaDB - Vector database for semantic search
- LangChain - LLM application framework
- feedparser - RSS feed parsing
- requests - HTTP library
- python-dotenv - Environment variable management
- Python 3.9+
- uv (modern Python package manager)
- FFmpeg (for audio processing)
- Google AI API key (free tier available)
- Optional: get_iplayer for BBC iPlayer downloads
# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repository
git clone <your-repo-url>
cd bbc-audio-rag
# 3. Install dependencies
uv sync
# 4. Set up your Google AI API key
cp .env.example .env
# Edit .env and add your GOOGLE_AI_API_KEY
# 5. Run the app
uv run python app.pyThen open your browser to http://localhost:7860 and start downloading, transcribing, and chatting! π
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uvcd /home/wut/playground/reith-lecture
# Create virtual environment and install all dependencies
uv syncThis will:
- Create a
.venvvirtual environment - Install all dependencies from
pyproject.toml - Set up the project in editable mode
Ubuntu/Debian:
sudo apt update && sudo apt install ffmpegmacOS:
brew install ffmpegWindows: Download from https://ffmpeg.org/download.html
# Ubuntu/Debian
sudo apt install get-iplayer
# macOS
brew install get_iplayer
# Or install from: https://github.com/get-iplayer/get_iplayerCreate a .env file:
cp .env.example .envEdit .env and add your Google AI API key:
GOOGLE_AI_API_KEY=your_api_key_here
Get a free Google AI API key at: https://makersuite.google.com/app/apikey
This project uses uv for dependency management. pyproject.toml is the source of truth.
To sync your environment with pyproject.toml:
uv syncIf you need a requirements.txt file (e.g., for legacy deployments), you can export it from uv.lock:
# Update lock file first
uv lock
# Export to requirements.txt
uv export --format requirements-txt --no-hashes --no-emit-project > requirements.txtuv run python app.pyThen open your browser to http://localhost:7860
Download audio from RSS feed:
uv run python -c "from src.scraper.rss_scraper import RSScraper; scraper = RSScraper(); scraper.download_episodes('https://podcasts.files.bbci.co.uk/p00fzl9g.rss', limit=5)"Transcribe audio:
uv run python -c "from src.transcription.transcriber import WhisperTranscriber; transcriber = WhisperTranscriber(model_size='base'); transcript = transcriber.transcribe_and_save('downloads/episode.mp3'); print(transcript)"Chat with transcripts:
uv run python -c "from src.chat.chat_engine import ChatEngine; chat = ChatEngine(); response = chat.ask('What are the main themes discussed?'); print(response['response'])"reith-lecture/
βββ app.py # Main Gradio application
βββ config.py # Configuration management
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ README.md # This file
βββ src/
β βββ scraper/
β β βββ bbc_scraper.py # BBC website scraper
β β βββ get_iplayer_wrapper.py # get_iplayer wrapper
β β βββ rss_scraper.py # RSS feed parser
β βββ transcription/
β β βββ transcriber.py # Whisper transcription (FREE)
β β βββ audio_processor.py # Audio utilities
β βββ chat/
β β βββ vector_store.py # ChromaDB for RAG
β β βββ chat_engine.py # Google AI chat engine
β βββ utils/
β βββ logger.py # Logging utilities
β βββ file_manager.py # File management
βββ downloads/ # Downloaded audio files
βββ transcripts/ # Generated transcripts
βββ data/ # Vector database
βββ tests/ # Unit tests
- Create a new Space at https://huggingface.co/spaces
- Choose "Gradio" as the SDK
- Upload all files from this project
- Add your
GOOGLE_AI_API_KEYin Space Settings β Repository secrets - Your app will be live!
The Reith Lectures are available as a podcast:
RSS Feed: https://podcasts.files.bbci.co.uk/p00fzl9g.rss
Use the Download tab in the Gradio app or:
from src.scraper.rss_scraper import RSScraper
scraper = RSScraper()
scraper.download_episodes('https://podcasts.files.bbci.co.uk/p00fzl9g.rss', limit=10)Whisper is slow:
- Use a smaller model:
tinyorbasefor faster transcription - The
mediumandlargemodels require significant CPU/GPU resources - Consider using a GPU-enabled machine for 10-100x speedup
Out of memory errors:
- Switch to a smaller Whisper model
- Process shorter audio segments
- Close other applications to free up RAM
FFmpeg not found:
- Make sure FFmpeg is installed:
ffmpeg -version - On Linux:
sudo apt install ffmpeg - On macOS:
brew install ffmpeg - On Windows: Download from https://ffmpeg.org/download.html and add to PATH
Audio download fails:
- Check your internet connection
- Verify the RSS feed URL is correct
- Some BBC content may be region-restricted
"Google AI API key not configured" error:
- Make sure you've created a
.envfile (copy from.env.example) - Add your API key:
GOOGLE_AI_API_KEY=your_key_here - Get a free key at: https://makersuite.google.com/app/apikey
- Restart the application after adding the key
"Model not found" error:
- The app uses
gemini-flash-latestmodel - Make sure your API key is valid and active
- Check if you have access to Gemini API in your region
No relevant context found:
- Make sure you've loaded transcripts to the vector store (click "Load Transcripts" button)
- Try rephrasing your question
- Ensure transcripts exist in the
transcripts/directory
get_iplayer not working:
- Update the cache:
get_iplayer --refresh - Check if BBC iPlayer is available in your region
- RSS feeds are recommended as a more reliable alternative
- Researchers: Analyze BBC documentaries and lectures
- Students: Study and reference educational content
- Journalists: Search through interview archives
- Podcast Enthusiasts: Build a searchable podcast library
- Accessibility: Generate transcripts for hearing-impaired users
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
MIT License - Free for personal and educational use.
This tool is for personal use and educational purposes only.
- Respect BBC's terms of service and copyright
- Do not redistribute downloaded content
- Transcripts are generated by AI and may contain errors
- Use responsibly and ethically
- BBC for providing excellent audio content
- OpenAI for the Whisper model
- Google for Gemini AI
- ChromaDB for the vector database
- All the open-source contributors who made this possible
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Search existing GitHub Issues
- Create a new issue with detailed information
Made with β€οΈ for BBC audio enthusiasts and AI learners