🕸️ AI Web Scraper using Python & LLM

An intelligent web scraping application built with Python, powered by Selenium, BeautifulSoup, LangChain, and Ollama. This app takes a website URL, scrapes and cleans the DOM content, and lets you interact with the scraped data using natural language prompts. Perfect for extracting structured info from unstructured web pages using AI.

🚀 Features

🔗 Enter any website URL to scrape
🧠 Extract meaningful content from raw DOM using AI
💬 Ask natural language questions to parse data
🔍 View full scraped DOM content
⚙️ Tools used: Selenium, BeautifulSoup, LangChain, Ollama, Streamlit

📷 Preview

Result

🛠️ Tech Stack

Tool	Purpose
Python	Core language
Streamlit	Web interface
Selenium	Web scraping
BeautifulSoup	HTML parsing
LangChain	Prompt template + chaining logic
Ollama	Local LLM backend (LLaMA3)

📂 Project Structure

.
├── main.py              # Streamlit frontend
├── scrape.py            # Scraping logic (Selenium + BS4)
├── parse.py             # Parsing logic using LangChain + Ollama
├── chromedriver         # Chrome driver for Selenium
├── requirements.txt     # Python dependencies
└── README.md

📦 Installation

Clone the repository

git clone https://github.com/your-username/ai-web-scraper.git
cd ai-web-scraper

Create and activate a virtual environment

python -m venv venv
source venv/bin/activate    # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Install and run Ollama

Follow instructions at https://ollama.com to install ollama, then download and run the LLaMA model:

ollama run llama3

Run the app

streamlit run main.py

✨ How It Works

You enter a website URL.
Selenium loads the page and gets the HTML content.
BeautifulSoup extracts and cleans the <body> tag.
You type a natural language question.
The DOM is split into chunks and passed to LLaMA 3 via LangChain prompts.
The AI parses and returns specific content matching your query.

🧠 Sample Use Cases

🔍 Scrape a blog and ask: "Give me all the dates mentioned in the blog posts."
📋 Extract headlines from a news site: "List all article headlines from this page."

📝 To-Do

Add scroll support to Selenium scraper
Add support for multi-page scraping
Save parsed results to CSV/JSON
Add dark mode to UI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕸️ AI Web Scraper using Python & LLM

🚀 Features

📷 Preview

Result

🛠️ Tech Stack

📂 Project Structure

📦 Installation

✨ How It Works

🧠 Sample Use Cases

📝 To-Do

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
images		images
.gitignore		.gitignore
README.md		README.md
chromedriver		chromedriver
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

donjoo/Ai_WebScraper

Folders and files

Latest commit

History

Repository files navigation

🕸️ AI Web Scraper using Python & LLM

🚀 Features

📷 Preview

Result

🛠️ Tech Stack

📂 Project Structure

📦 Installation

✨ How It Works

🧠 Sample Use Cases

📝 To-Do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages