An intelligent web scraping application built with Python, powered by Selenium, BeautifulSoup, LangChain, and Ollama. This app takes a website URL, scrapes and cleans the DOM content, and lets you interact with the scraped data using natural language prompts. Perfect for extracting structured info from unstructured web pages using AI.
- 🔗 Enter any website URL to scrape
- 🧠 Extract meaningful content from raw DOM using AI
- 💬 Ask natural language questions to parse data
- 🔍 View full scraped DOM content
- ⚙️ Tools used:
Selenium
,BeautifulSoup
,LangChain
,Ollama
,Streamlit
Tool | Purpose |
---|---|
Python | Core language |
Streamlit | Web interface |
Selenium | Web scraping |
BeautifulSoup | HTML parsing |
LangChain | Prompt template + chaining logic |
Ollama | Local LLM backend (LLaMA3) |
.
├── main.py # Streamlit frontend
├── scrape.py # Scraping logic (Selenium + BS4)
├── parse.py # Parsing logic using LangChain + Ollama
├── chromedriver # Chrome driver for Selenium
├── requirements.txt # Python dependencies
└── README.md
- Clone the repository
git clone https://github.com/your-username/ai-web-scraper.git
cd ai-web-scraper
- Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Install and run Ollama
Follow instructions at https://ollama.com to install ollama
, then download and run the LLaMA model:
ollama run llama3
- Run the app
streamlit run main.py
- You enter a website URL.
Selenium
loads the page and gets the HTML content.BeautifulSoup
extracts and cleans the<body>
tag.- You type a natural language question.
- The DOM is split into chunks and passed to
LLaMA 3
viaLangChain
prompts. - The AI parses and returns specific content matching your query.
-
🔍 Scrape a blog and ask: "Give me all the dates mentioned in the blog posts."
-
📋 Extract headlines from a news site: "List all article headlines from this page."
- Add scroll support to Selenium scraper
- Add support for multi-page scraping
- Save parsed results to CSV/JSON
- Add dark mode to UI