This repository contains four scripts designed for efficient data processing, retrieval, and summarization:
- yt_scraper.py: Processes YouTube videos by searching, downloading transcripts, summarizing content, and storing metadata.
- web_scraper.py: Extracts content and links from a specified domain, generates summaries, and tracks progress.
- app.py: Implements a Retrieval-Augmented Generation (RAG) chatbot using Milvus for context retrieval and a custom API for generating responses.
- load_db.py: Processes text files, generates embeddings, and inserts them into a Milvus vector database for efficient retrieval.
- YouTube Scraper (
yt_scraper.py
) - Web Scraper (
web_scraper.py
) - RAG Chatbot (
app.py
) - Milvus Data Loader (
load_db.py
) - How to Run
- Future Enhancements
sys
,time
,threading
,random
,os
,json
,dotenv
,requests
youtube_transcript_api
youtube_search
googleapiclient.discovery
YOUTUBE_API_KEY
: Set in.env
file for YouTube API authentication.
OUTPUT_DIR
: Directory for storing text data.SUMMARY_DIR
: Directory for saving summaries.CACHE_FILE
: File for processed video IDs.QUEUE_FILE
: File for managing queued tasks.CONTEXT_WINDOW
: Maximum context size for summarization chunks.
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(SUMMARY_DIR, exist_ok=True)
data/
├── text_data/ # Stores text files with video details and transcripts.
├── summaries/ # Stores generated summaries.
├── processed_videos.json # Tracks processed video IDs.
├── queue.json # Tracks the current processing queue.
print_console_stats()
: Displays live statistics.load_cache() / save_cache()
: Manages processed video cache.load_queue() / save_queue()
: Manages the processing queue.log_error()
: Logs errors to a rotating list.
download_transcript(video_id)
: Fetches English transcripts.get_video_details(video_id)
: Retrieves video metadata.save_as_text(video_id, video_details, transcript, output_dir)
: Saves metadata and transcripts.search_and_download_videos(query, output_dir, max_videos, cached_video_ids, queue)
: Searches and downloads videos.
process_queue(queue)
: Processes queued files, splits text into chunks, summarizes them, and saves summaries.
split_into_chunks(text, chunk_size, overlap=500)
: Splits text into chunks.summarize(text)
: Summarizes content using a custom API.
- Reads topics from
input.txt
. - Initializes caches and processing queue.
- Launches background threads:
- Queue Processor: Processes tasks in the queue.
- Statistics Display: Updates console stats.
- Iteratively:
- Searches videos for topics.
- Downloads metadata and transcripts.
- Queues files for summarization.
threading
,requests
,bs4
,os
,time
,json
,sys
,re
,dotenv
BASE_URL
: Root URL for scraping.UNWANTED_PATH_SEGMENTS
: Path segments to exclude.
OUTPUT_DIR
: Directory for raw text.SUMMARY_DIR
: Directory for summaries.CACHE_FILE
: File for visited URLs.QUEUE_FILE
: File for task management.CONTEXT_WINDOW
: Character limit for content chunks.
data/website/
├── text_data/ # Stores scraped content.
├── summaries/ # Stores generated summaries.
├── processed_videos.json # Tracks visited URLs.
├── queue.json # Tracks the current processing queue.
is_same_domain(link, base_domain)
: Checks if a link belongs to the same domain.normalize_url(url)
: Removes fragments from a URL.filter_links_by_segments(links, base_domain, unwanted_segments)
: Filters unwanted links.load_cache() / save_cache()
: Manages visited URLs.load_queue() / save_queue()
: Manages the processing queue.log_error()
: Logs errors.
get_all_links_and_text(url, base_domain)
: Fetches content and links.scrape_domain(base_url, output_dir, queue, max_pages=100, unwanted_segments=None, cache=None)
: Recursively scrapes pages.
- Loads configuration from
.env
. - Initializes caches and queue.
- Launches background threads for processing queue and stats.
- Calls
scrape_domain()
to:- Extract links and content.
- Save content and queue tasks.
pymilvus
,requests
,tkinter
,json
- Retriever: Fetches context from Milvus.
- Generator: Generates responses using a custom API.
- ChatbotUI: Interactive chatbot GUI.
- Host:
127.0.0.1
- Port:
19530
- Collection:
embedded_texts
- Dimension:
4096
- Embedding API:
http://127.0.0.1:11434/api/embed
- Generation API:
http://127.0.0.1:11434/api/generate
- User Input: User enters a query.
- Context Retrieval:
- Query is embedded and sent to Milvus.
- Retrieves relevant texts.
- Response Generation:
- Constructs a prompt using the context.
- Sends to Generation API.
- Display Response: Shows in the GUI.
pymilvus
,requests
,os
,tqdm
- MilvusHandler: Manages Milvus connections and data insertion.
- TextEmbeddingProcessor: Generates embeddings and splits text.
- DataLoader: Handles file operations.
- EmbeddingPipeline: Orchestrates data loading and insertion.
- Host:
127.0.0.1
- Port:
19530
- Collection:
embedded_texts
- Dimension:
4096
- Embedding API:
http://127.0.0.1:11434/api/embed
- Place text files in
./data
.
- Load text files.
- Split text into chunks.
- Generate embeddings using the API.
- Insert data into Milvus.
- Create an index for efficient retrieval.
- Install dependencies:
pip install -r requirements.txt
- Start required services (e.g., Milvus).
- Configure
.env
files. - Run the desired script:
python yt_scraper.py python web_scraper.py python app.py python load_db.py
- Error Recovery: Retry failed API calls.
- Scalability: Batch processing for large datasets.
- Multi-Collection Support: Handle multiple datasets dynamically.
- Improved Indexing: Support advanced index types for Milvus.