Utilities to automate processing of PyBay conference videos for publication on YouTube and PyVideo.
This toolkit helps volunteers prepare PyBay conference videos for publication by:
- Downloading videos from Google Drive
- Fetching talk metadata from the PyBay website
- Renaming videos to a consistent, publication-ready formatg
Key Design Principles:
- Use public information - Relies on publicly accessible pybay.org pages to avoid requiring volunteers to access complex/paid systems (Sessionize, paid Google Drive accounts, etc.)
- Handle variability - Works with inconsistent input from multiple sources that change year-to-year
- Minimize friction - Designed for volunteers who perform this task once per year
Publishing PyBay videos involves reconciling data from multiple sources with varying quality:
-
Speaker-provided data (via Sessionize):
- Talk titles, descriptions, speaker names
- We don't control this input - speakers can format names inconsistently
- Changes format/structure year-to-year
-
AV team video filenames:
- VERY LOOSE file naming standards that changes slightly every year
- Examples from 2025:
Robertson - 1000 - Brousseau - Welcome Remarks.mp4 - May use different time formats (12hr vs 24hr), varying separators, etc.
- Different person may handle this each year → different conventions
-
Google Drive organization:
- Videos uploaded by AV team
- Requires authentication to access
- Original filenames preserved in metadata
Our solution: Use the official schedule published on the public PyBay website as the authoritative source of truth, then match videos using intelligent token-based matching (room + time + speaker name).
# Clone the repo
git clone https://github.com/pybay/pybay-video-publishing-helpers.git
cd pybay-video-publishing-helpers
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtDownload and rename all videos in one command:
python src/google_drive_video_downloader.py \
--gdrive-url "https://drive.google.com/drive/folders/YOUR_FOLDER_ID" \
--output-path "pybay_videos_destination" \
--year 2025This single command automatically:
- ✅ Downloads all videos in parallel (4-8x faster)
- ✅ Saves metadata →
_pybay_2025_gdrive_metadata.json - ✅ Fetches talk data →
_pybay_2025_talk_data.json - ✅ Renames to publication format →
Title — Speaker (PyBay 2025).mp4 - ✅ Flags unmatched files for review →
![REVIEW_NEEDED]_filename.mp4 - ✅ Verifies downloads with MD5 checksums
- ✅ Skips already-downloaded files (resumable)
Using service account authentication:
export GOOGLE_DRIVE_API_KEY_PYBAY='{"type":"service_account",...}'
python src/google_drive_video_downloader.py \
--gdrive-url "YOUR_FOLDER_ID" \
--output-path "pybay_videos_destination" \
--year 2025 \
--service-accountFor volunteers who want more control, or want to rename videos with a different pattern after downloading:
# Step 1: Download only (skip renaming)
python src/google_drive_video_downloader.py \
--gdrive-url "YOUR_FOLDER_ID" \
--output-path "pybay_videos_destination" \
--year 2025 \
--download-only
# Step 2: Rename separately (with dry-run preview first)
python src/file_renamer.py \
--video-dir "pybay_videos_destination" \
--year 2025 \
--dry-run
# Then actually rename
python src/file_renamer.py \
--video-dir "pybay_videos_destination" \
--year 2025File naming:
Downloaded from GDrive: Robertson - 1000 - Brousseau - Welcome Remarks.mp4
Renamed to: Welcome & Opening Remarks — Chris Brousseau (PyBay 2025).mp4
Downloaded from GDrive: Robertson - 1000 - Pliger - PyScript Talk.mp4
Renamed to: Next Level Python Applications with PyScript — Fabio Pliger & Chris Laffra (PyBay 2024).mp4
Handles talks with multiple speakers (panels, co-presentations):
JSON Format:
{
"talk_title": "Next Level Python Applications with PyScript",
"speakers": [
{"firstname": "Fabio", "lastname": "Pliger"},
{"firstname": "Chris", "lastname": "Laffra"}
]
}Filename Output:
Next Level Python Applications with PyScript — Fabio Pliger & Chris Laffra (PyBay 2024).mp4
Matches videos to talk metadata with three data elements:
- Room - Case-insensitive (e.g., Robertson, Fisher.)
- Time - Normalized to 24-hour format (handles "10:00 am", "1000", "2:30 pm", "1430")
- Name - Partial matching (handles "van Rossum", "Hatfield-Dodds", single names)
For multi-speaker talks, matches if ANY speaker name appears in the filename.
- ✅ Multiple speakers joined with " & " (we often have 1-2 every year, last one in 2024)
- ✅ Hyphenated last names (e.g., Hatfield-Dodds)
- ✅ Single names (e.g., no last name, which comes from incomplete Sessionize profiels)
- ✅ Multi-part surnames (e.g., van Rossum)
- ✅ Missing name data (uses whatever is available)
- ✅ Files without metadata flagged for manual review by adding prefix to final filename
- README_VIDEO_PUBLISHING_WORKFLOW.md - Complete workflows with diagrams
- README_GOOGLE_DRIVE_SETUP.md - Google Drive auth setup
Some tests written - could use more for sure
Test Coverage:
- Multi-speaker handling (22 tests)
- Web scraping and parsing (13 tests)
- Time normalization (15 tests)
pybay-video-publishing-helpers/
├── src/
│ ├── google_drive_video_downloader.py # Main download script (parallel)
│ ├── file_renamer.py # Token-based renamer
│ ├── scraper_pybayorg_talk_metadata.py # Scrapes pybay.org for talk data
│ ├── google_drive_fetch_metadata.py # Standalone metadata fetcher
│ ├── google_drive_ops.py # Google Drive API operations
│ ├── file_ops.py # File verification utilities
│ └── file_ops_parallel.py # Fast parallel download functions
├── tests/
│ ├── test_multi_speaker.py # Multi-speaker functionality tests
│ ├── test_scraper.py # Scraper function tests
│ └── test_time_normalization.py # Time parsing tests
├── README_VIDEO_PUBLISHING_WORKFLOW.md # Complete workflow documentation
├── README_GOOGLE_DRIVE_SETUP.md # Authentication setup guide
└── requirements.txt # Python dependencies
- Source:
https://pybay.org/speaking/talk-list-YYYY/ - Format: Sessionize API HTML
- Contains: Talk titles, speaker names, rooms, times, descriptions
- Saved to:
_pybay_YYYY_talk_data.json - Why: Publicly accessible, authoritative source of truth
- Source: Google Drive API
- Contains: Original filenames from AV provider, file sizes, MD5 checksums
- Saved to:
_pybay_YYYY_gdrive_metadata.json - Why: Preserves audit trail of original AV team filenames
- Current format:
{Room} - {Time} - {LastName} - {Title}.mp4 - Final format:
{Title} — {FirstName} {LastName} ({Year}).mp4 - Note: AV Team's Naming conventions vary year-to-year
Cause: Last-minute speaker changes, Alternate Speakers not added to official schedule in Sessionize, uAV team filename variations
Solution:
- Renamer flags unmatched files for manual review
- Manually rename these files, or
- Add missing entries to
_pybay_YYYY_talk_data.json
Cause: Inconsistent time formats between AV team and website
Solution:
- Renamer normalizes all times to 24-hour format automatically
- Handles:
10am,10:00 am,1000,1430,2:30 pm, etc.
Cause: Fresh download didn't create metadata, or files were deleted
Solution:
# Re-fetch Google Drive metadata (doesn't re-download videos)
python src/google_drive_fetch_metadata.py \
--folder "YOUR_DRIVE_URL" \
--year 2025
# Fetch PyBay website metadata
python src/scraper_pybayorg_talk_metadata.py \
--url "https://pybay.org/speaking/talk-list-2025/" \
--output "pybay_videos_destination/_pybay_2025_talk_data.json"This is a volunteer-driven project. Contributions welcome!
New Features:
- Upload to SF Python YouTube channel and playlist (needed!)
- Automate creation of metadata for PyVideo
- Improve fuzzy matching for edge cases
- Integrate tqdm progress tracker for better download visibility
Test Coverage Gaps:
Areas without tests:
- Google Drive operations (
google_drive_ops.py,google_drive_video_downloader.py) - File operations (
file_ops.py,file_ops_parallel.py) - Credential checking (
google_drive_check_credentials.py) - Metadata fetching (
google_drive_fetch_metadata.py)
Note for Future Volunteers: This repo was designed to be a little resilient to changes we have seen in past few years, but if something breaks, check:
- Has the AV team changed their filename format?
- Has pybay.org changed its URL structure?
- Has Sessionize changed its HTML structure?