Skip to content

analyticsrepo01/video_search

Repository files navigation

Video RAG Search Pipeline

A streamlined Retrieval-Augmented Generation (RAG) system for searching and analyzing video content using Google Vertex AI. This pipeline extracts metadata from video files and creates an intelligent search interface without requiring CSV intermediate steps.

🎯 Features

  • Direct Upload Approach: No CSV files needed - uploads directly to RAG corpus
  • Smart Video Analysis: Extracts metadata from filenames and content
  • Multilingual Support: Handles English and Bahasa Indonesia content
  • Content Classification: Automatically categorizes romance/drama, sports, and news content
  • Optimized Search: Low similarity threshold for better context retrieval
  • Ready-to-Use: Simple functions for immediate video content search

πŸ“Š Available Content Types - ALL types Supported -- just add in prompt

🎭 Romance/Drama Series (Currently added are these -- but any content can be added )

  • "Cinta Sedalam Rindu" episodes with character analysis
  • Episode number extraction and character relationship mapping
  • Automatic keyword generation in English and Indonesian

⚽ Sports Highlights

  • Indonesian football/soccer match highlights
  • Team extraction (Persebaya, Arema FC, Persija, etc.)
  • Match week information (Pekan 13, 14, etc.)
  • Sports-specific metadata and keywords

πŸ“Ί News Reports

  • Liputan 6 news coverage
  • Current affairs and journalism content
  • Event-based categorization and analysis

πŸš€ Quick Start

Prerequisites

  • Google Cloud Project with Vertex AI API enabled
  • Video files in a local directory
  • Python environment with required packages

Installation

pip install google-cloud-aiplatform google-cloud-storage google-genai pandas pathlib

Usage

  1. Configure your project settings:
PROJECT_ID = "your-project-id"  # Update this
VIDEO_FOLDER = "video"          # Update this path
  1. Run the complete pipeline:

    • Extract video metadata
    • Create RAG corpus
    • Upload analysis directly (no CSV!)
    • Create search interface
  2. Search your content:

# Search for specific content
search_videos("What romance videos do we have?")
search_videos("Show me sports highlights from Pekan 13")
search_videos("Which videos have episode information?")

# Test the system
test_search_system()

πŸ“ Dataset Files

This repository includes sample datasets for testing and evaluation:

GoldenSet.csv

  • 30 test queries for evaluation
  • Query types: Soap opera, Sports, News
  • Difficulty levels: Easy, Medium
  • Languages: English and Indonesian queries
  • Expected answers for quality assessment

video_metadata_analysis.csv (Optional Reference)

  • Sample output format showing metadata extraction results
  • 52 entries with comprehensive video analysis
  • Includes embeddings and content classification
  • Note: The new pipeline doesn't require CSV - this is for reference only

πŸ—οΈ Architecture

Traditional RAG Approach:

Video Files β†’ Metadata β†’ CSV β†’ Cloud Storage β†’ RAG Import β†’ Search

Our Streamlined Approach:

Video Files β†’ Metadata β†’ Direct RAG Upload β†’ Search

πŸ’‘ Key Benefits

  • βœ… No CSV intermediate files - saves storage and complexity
  • βœ… 50% fewer processing steps - faster pipeline
  • βœ… Direct metadata control - better file organization
  • βœ… Optimized retrieval settings - improved search accuracy
  • βœ… Simple debugging - easier troubleshooting

πŸ“ Example Queries

Romance/Drama Content:

  • "Show me videos about Aluna and Galaxy"
  • "Which episodes have character relationships?"
  • "What romance drama content is available?"

Sports Content:

  • "Find football highlights from this week"
  • "Show me matches between Persebaya and Arema"
  • "What sports content do we have?"

News Content:

  • "Tell me about recent news videos"
  • "Show me Liputan 6 coverage"
  • "What current affairs content is available?"

πŸ”§ Technical Details

RAG Configuration:

  • Embedding Model: text-multilingual-embedding-002 -- Bahasa support
  • Similarity Threshold: 0.1 (optimized for better retrieval)
  • Top-K Results: 15 (comprehensive context)
  • Chunk Size: 1000 tokens with 200 overlap

Content Analysis:

  • Filename parsing for metadata extraction
  • Character recognition for drama series
  • Team and match extraction for sports
  • Event categorization for news

Search Optimization:

  • Enhanced query expansion with multilingual terms
  • Context-aware retrieval with fallback strategies
  • Specific metadata inclusion in responses

🎬 Video Analysis Pipeline

  1. Metadata Extraction: Smart analysis of video filenames
  2. Content Classification: Automatic genre and type detection
  3. Keyword Generation: Multilingual searchable terms
  4. Direct Upload: Stream to RAG corpus without CSV
  5. Search Interface: Optimized retrieval and generation

πŸ“ˆ Performance

The streamlined approach provides:

  • Faster processing: Direct upload eliminates conversion steps
  • Better accuracy: Optimized similarity thresholds
  • Improved organization: Individual file tracking
  • Enhanced debugging: Clear error isolation

πŸ› οΈ Customization

Adding New Content Types:

Extend the analyze_filename_content() function to support additional video categories.

Adjusting Search Parameters:

Modify retrieval configuration in create_search_interface() for your specific needs.

Enhanced Video Analysis:

The pipeline supports Gemini API integration for actual video content analysis beyond filename parsing.

πŸ“‹ File Structure

β”œβ”€β”€ Rag_search.ipynb         # Main pipeline notebook
β”œβ”€β”€ GoldenSet.csv           # Test queries dataset
β”œβ”€β”€ video_metadata_analysis.csv  # Sample output reference
└── README.md              # This documentation

🎯 Use Cases

  • Media Content Management: Organize and search video libraries
  • Educational Content: Find specific episodes or topics
  • Sports Analysis: Search match highlights and statistics
  • News Monitoring: Track coverage and events
  • Content Discovery: Intelligent video recommendation

πŸš€ Getting Started

  1. Clone this repository
  2. Open Rag_search.ipynb in Jupyter
  3. Configure your Google Cloud project settings
  4. Add your video files to the specified folder
  5. Run all cells to create your RAG search system
  6. Start searching with search_videos("your question")

Note: This pipeline is designed for educational and demonstration purposes. Ensure you have proper permissions for your video content and comply with relevant data usage policies.

About

Searching through Videos is still tough -- lets make it easy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors