A comprehensive platform for analyzing video content using Google's Gemini models. This project enables downloading YouTube videos, processing them with Gemini Flash 2.0, and generating intelligent responses to questions about the videos.
This platform streamlines the process of working with video datasets by:
- Fetching datasets from HuggingFace
- Downloading videos using pytubefix
- Uploading videos to Google's File API
- Performing inference with Gemini Flash 2.0
- Saving results to CSV for analysis
The modular architecture separates concerns, making the workflow maintainable and efficient.
The system is composed of three main components:
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ fetch_dataset.py │ ──► │ download_upload.py │ ──► │ inference.py │
└───────────────────┘ └───────────────────┘ └───────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ dataset.csv │ │ videos + metadata│ │ results.csv │
└─────────────┘ └─────────────────┘ └─────────────────┘
- Efficient Processing: Checks for previously downloaded/uploaded videos to avoid duplication
- Robust Error Handling: Automatic retries with exponential backoff
- Interactive Progress: Detailed logging and progress bars
- Flexible Configuration: Extensive command-line options
- Metadata Tracking: Comprehensive tracking of video data
- Modular Design: Separates fetching, downloading, and inference
- Resume Support: Can be stopped and resumed at any point
- Python 3.8+
- HuggingFace Account with access to the dataset
- Google API Key with access to Gemini models
-
Clone this repository:
git clone https://github.com/dadevchia/tiktokllm.git cd tiktokllm -
Install required packages:
pip install -r requirements.txt
-
Log in to HuggingFace (required to access datasets):
huggingface-cli login
-
Create a
.envfile with your API credentials:touch .env
-
Add the following to your
.envfile:GEMINI_API_KEY=your_google_api_key_here
# Basic usage
python fetch_dataset.py
# Advanced usage
python fetch_dataset.py --split test --output custom_dataset.csv --cache-dir ./hf_cacheOptions:
--output: Output CSV file path (default: dataset.csv)--split: Dataset split to use (default: test)--cache-dir: Cache directory for HuggingFace datasets
# Basic usage
python download_upload.py
# Advanced usage
python download_upload.py --batch-size 20 --input-csv custom_dataset.csvOptions:
--api-key: Your Google Generative AI API key (uses .env by default)--batch-size: Number of videos to process in one batch (default: 10)--start-index: Index to start from in the dataset (default: 0)--max-videos: Maximum number of videos to process (default: all)--input-csv: Path to input CSV file (default: dataset.csv)
# Basic usage
python inference.py
# Advanced usage
python inference.py --model gemini-2.0-flash --retry 5 --output custom_results.csvOptions:
--api-key: Your Google Generative AI API key (uses .env by default)--model: Model name to use (default: gemini-2.0-flash)--retry: Number of retries for failed inferences (default: 3)--start-index: Index to start from in the list of videos (default: 0)--max-videos: Maximum number of videos to process (default: all)--output: Output CSV file for results (default: results.csv)
dataset.csv: HuggingFace dataset in CSV formatvideo_metadata.csv: Metadata about downloaded and uploaded videosdownloaded_videos.json: Tracking info for downloaded and uploaded videosresults.csv: Final inference results with qid and pred columnsvideos/: Directory containing downloaded videos organized by QID prefix
A typical workflow might look like:
# 1. Set up environment and fetch dataset
python fetch_dataset.py
# 2. Download first 10 videos for testing
python download_upload.py --max-videos 10
# 3. Run inference on these 10 videos
python inference.py --max-videos 10
# 4. Process the remaining videos in batches
python download_upload.py --start-index 10 --batch-size 20
python inference.py --start-index 10-
Video Download Failures
- Check internet connection
- Verify video still exists on YouTube
- Try updating pytubefix:
pip install --upgrade pytubefix
-
Google API Errors
- Verify your API key is correct in the .env file
- Check if you have access to the specified model
- Ensure your Google Cloud billing is properly set up
-
HuggingFace Access Issues
- Run
huggingface-cli loginagain - Verify you have access to the dataset
- Check your internet connection
- Run
All scripts create detailed logs in the console. For persistent logs, redirect output:
python download_upload.py > download_log.txt 2>&1tiktokllm/
├── fetch_dataset.py # Script to fetch HuggingFace dataset
├── download_upload.py # Script to download/upload videos
├── inference.py # Script for Gemini inference
├── .env # Environment variables (API keys)
├── requirements.txt # Project dependencies
├── dataset.csv # Generated by fetch_dataset.py
├── video_metadata.csv # Generated by download_upload.py
├── downloaded_videos.json # Generated by download_upload.py
├── results.csv # Generated by inference.py
└── videos/ # Downloaded video files
└── [QID_PREFIX]/ # Organized by QID prefix
- Start with a small batch (e.g.,
--max-videos 5) to test your setup - Run scripts using screen or tmux for long-running processes
- Regularly backup your metadata and tracking files
- Monitor disk space, especially if downloading many videos
- Consider setting up a cron job for large datasets
[Your Name/Organization]
[Specify your license here]