AI Web Scraper

An intelligent web scraping solution that combines automated browser-based scraping with AI-powered content parsing and Google Sheets integration.

Overview

AI Web Scraper is a Streamlit-based application that handles the complete web scraping workflow:

Automated Scraping: Uses Selenium with Bright Data proxy to handle anti-bot measures and CAPTCHAs
Intelligent Parsing: Leverages Ollama LLM (Large Language Model) to extract specific information from web content
Result Storage: Stores results in both local cache and Google Sheets for easy access and sharing
User-Friendly Interface: Provides a clean web interface for all scraping operations

Features

CAPTCHA Handling: Automatically solves CAPTCHAs using Bright Data's Scraping Browser
Intelligent Content Extraction: Uses local LLM to extract exactly what you need from scraped content
Caching System: Efficiently caches scraped content to minimize redundant requests
Google Sheets Integration: Stores and indexes scraped data for collaborative access
Search & Retrieve: Find previously parsed content through text search

Requirements

Python 3.8+
Ollama with the Llama3 model installed locally
Bright Data account with Scraping Browser access
Google Cloud Platform account (for Google Sheets integration)

Installation

Clone this repository:

git clone https://github.com/yourusername/AI_Web_Scraper.git
cd AI_Web_Scraper

Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file in the project root with the following variables:

BRD_AUTH=your_bright_data_auth_key
GOOGLE_CREDENTIALS_FILE=credentials.json

For Google Sheets integration:
- Follow instructions in the Google Cloud Console to create a service account
- Download the credentials JSON file and save as credentials.json in the project root

Usage

Start the Streamlit app:
```
streamlit run main.py
```
Access the web interface at http://localhost:8501
Enter a URL to scrape
Describe the information you want to extract
View and search parsed results in the app or Google Sheets

Project Structure

main.py: Streamlit web interface
scrape.py: Web scraping functionality using Selenium
parse.py: Content parsing using Ollama LLM
cache_manager.py: Local caching system
gsheets_storage.py: Google Sheets integration
find_sheet.py: Utility to find available Google Sheets

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Uses Bright Data for CAPTCHA solving and proxy services
Powered by Ollama for local LLM inference
Built with Streamlit for the web interface
Integrates with Google Sheets API for data storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Web Scraper

Overview

Features

Requirements

Installation

Usage

Project Structure

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cache_manager.py		cache_manager.py
find_sheet.py		find_sheet.py
gsheets_storage.py		gsheets_storage.py
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

License

Cecile-Hirschauer/AI_Web_Scraper

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Overview

Features

Requirements

Installation

Usage

Project Structure

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages