An intelligent web scraping solution that combines automated browser-based scraping with AI-powered content parsing and Google Sheets integration.
AI Web Scraper is a Streamlit-based application that handles the complete web scraping workflow:
- Automated Scraping: Uses Selenium with Bright Data proxy to handle anti-bot measures and CAPTCHAs
- Intelligent Parsing: Leverages Ollama LLM (Large Language Model) to extract specific information from web content
- Result Storage: Stores results in both local cache and Google Sheets for easy access and sharing
- User-Friendly Interface: Provides a clean web interface for all scraping operations
- CAPTCHA Handling: Automatically solves CAPTCHAs using Bright Data's Scraping Browser
- Intelligent Content Extraction: Uses local LLM to extract exactly what you need from scraped content
- Caching System: Efficiently caches scraped content to minimize redundant requests
- Google Sheets Integration: Stores and indexes scraped data for collaborative access
- Search & Retrieve: Find previously parsed content through text search
- Python 3.8+
- Ollama with the Llama3 model installed locally
- Bright Data account with Scraping Browser access
- Google Cloud Platform account (for Google Sheets integration)
-
Clone this repository:
git clone https://github.com/yourusername/AI_Web_Scraper.git cd AI_Web_Scraper
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.env
file in the project root with the following variables:BRD_AUTH=your_bright_data_auth_key GOOGLE_CREDENTIALS_FILE=credentials.json
-
For Google Sheets integration:
- Follow instructions in the Google Cloud Console to create a service account
- Download the credentials JSON file and save as
credentials.json
in the project root
-
Start the Streamlit app:
streamlit run main.py
-
Access the web interface at
http://localhost:8501
-
Enter a URL to scrape
-
Describe the information you want to extract
-
View and search parsed results in the app or Google Sheets
main.py
: Streamlit web interfacescrape.py
: Web scraping functionality using Seleniumparse.py
: Content parsing using Ollama LLMcache_manager.py
: Local caching systemgsheets_storage.py
: Google Sheets integrationfind_sheet.py
: Utility to find available Google Sheets
This project is licensed under the MIT License - see the LICENSE file for details.
- Uses Bright Data for CAPTCHA solving and proxy services
- Powered by Ollama for local LLM inference
- Built with Streamlit for the web interface
- Integrates with Google Sheets API for data storage