A streamlined Retrieval-Augmented Generation (RAG) system for searching and analyzing video content using Google Vertex AI. This pipeline extracts metadata from video files and creates an intelligent search interface without requiring CSV intermediate steps.
- Direct Upload Approach: No CSV files needed - uploads directly to RAG corpus
- Smart Video Analysis: Extracts metadata from filenames and content
- Multilingual Support: Handles English and Bahasa Indonesia content
- Content Classification: Automatically categorizes romance/drama, sports, and news content
- Optimized Search: Low similarity threshold for better context retrieval
- Ready-to-Use: Simple functions for immediate video content search
- "Cinta Sedalam Rindu" episodes with character analysis
- Episode number extraction and character relationship mapping
- Automatic keyword generation in English and Indonesian
- Indonesian football/soccer match highlights
- Team extraction (Persebaya, Arema FC, Persija, etc.)
- Match week information (Pekan 13, 14, etc.)
- Sports-specific metadata and keywords
- Liputan 6 news coverage
- Current affairs and journalism content
- Event-based categorization and analysis
- Google Cloud Project with Vertex AI API enabled
- Video files in a local directory
- Python environment with required packages
pip install google-cloud-aiplatform google-cloud-storage google-genai pandas pathlib- Configure your project settings:
PROJECT_ID = "your-project-id" # Update this
VIDEO_FOLDER = "video" # Update this path-
Run the complete pipeline:
- Extract video metadata
- Create RAG corpus
- Upload analysis directly (no CSV!)
- Create search interface
-
Search your content:
# Search for specific content
search_videos("What romance videos do we have?")
search_videos("Show me sports highlights from Pekan 13")
search_videos("Which videos have episode information?")
# Test the system
test_search_system()This repository includes sample datasets for testing and evaluation:
- 30 test queries for evaluation
- Query types: Soap opera, Sports, News
- Difficulty levels: Easy, Medium
- Languages: English and Indonesian queries
- Expected answers for quality assessment
- Sample output format showing metadata extraction results
- 52 entries with comprehensive video analysis
- Includes embeddings and content classification
- Note: The new pipeline doesn't require CSV - this is for reference only
Video Files β Metadata β CSV β Cloud Storage β RAG Import β Search
Video Files β Metadata β Direct RAG Upload β Search
- β No CSV intermediate files - saves storage and complexity
- β 50% fewer processing steps - faster pipeline
- β Direct metadata control - better file organization
- β Optimized retrieval settings - improved search accuracy
- β Simple debugging - easier troubleshooting
- "Show me videos about Aluna and Galaxy"
- "Which episodes have character relationships?"
- "What romance drama content is available?"
- "Find football highlights from this week"
- "Show me matches between Persebaya and Arema"
- "What sports content do we have?"
- "Tell me about recent news videos"
- "Show me Liputan 6 coverage"
- "What current affairs content is available?"
- Embedding Model:
text-multilingual-embedding-002-- Bahasa support - Similarity Threshold: 0.1 (optimized for better retrieval)
- Top-K Results: 15 (comprehensive context)
- Chunk Size: 1000 tokens with 200 overlap
- Filename parsing for metadata extraction
- Character recognition for drama series
- Team and match extraction for sports
- Event categorization for news
- Enhanced query expansion with multilingual terms
- Context-aware retrieval with fallback strategies
- Specific metadata inclusion in responses
- Metadata Extraction: Smart analysis of video filenames
- Content Classification: Automatic genre and type detection
- Keyword Generation: Multilingual searchable terms
- Direct Upload: Stream to RAG corpus without CSV
- Search Interface: Optimized retrieval and generation
The streamlined approach provides:
- Faster processing: Direct upload eliminates conversion steps
- Better accuracy: Optimized similarity thresholds
- Improved organization: Individual file tracking
- Enhanced debugging: Clear error isolation
Extend the analyze_filename_content() function to support additional video categories.
Modify retrieval configuration in create_search_interface() for your specific needs.
The pipeline supports Gemini API integration for actual video content analysis beyond filename parsing.
βββ Rag_search.ipynb # Main pipeline notebook
βββ GoldenSet.csv # Test queries dataset
βββ video_metadata_analysis.csv # Sample output reference
βββ README.md # This documentation
- Media Content Management: Organize and search video libraries
- Educational Content: Find specific episodes or topics
- Sports Analysis: Search match highlights and statistics
- News Monitoring: Track coverage and events
- Content Discovery: Intelligent video recommendation
- Clone this repository
- Open
Rag_search.ipynbin Jupyter - Configure your Google Cloud project settings
- Add your video files to the specified folder
- Run all cells to create your RAG search system
- Start searching with
search_videos("your question")
Note: This pipeline is designed for educational and demonstration purposes. Ensure you have proper permissions for your video content and comply with relevant data usage policies.