📚 BookToScrape – Async Web Scraper (Python)

An asynchronous web scraping project built with Python, aiohttp, asyncio, and lxml to extract book data from the Books To Scrape demo website.

This project is an upgraded version of my first-ever web scraper, redesigned for performance, scalability, and cleaner architecture.

🚀 Project Highlights

⚡ Async-first design using asyncio and aiohttp
🕷️ Scrapes all books across paginated pages
📦 Extracts detailed product information from individual product pages
🧭 Collects sidebar category links
📁 Saves structured data into CSV files
⏱️ Massive performance improvement
- Old (BeautifulSoup, synchronous): ~24 minutes
- New (Async): ~2.5 minutes

🧰 Libraries Used

aiohttp – Asynchronous HTTP requests
asyncio – Concurrent task execution
lxml – Fast and powerful HTML parsing
pandas – Data cleaning and CSV generation
urllib.parse – URL handling and normalization

🧠 Data Extracted

📘 Product Data

For each book:

Title
Description
Rating
Image URL
Product Page URL
UPC
Price (including & excluding tax)
Availability
Product Type
Number of Reviews

🧭 Sidebar Data

Category name
Category URL

🕸️ How It Works

Fetches the homepage
Extracts sidebar categories
Iterates through all paginated product listing pages
Concurrently visits each product page
Scrapes and aggregates structured product data
Exports data to CSV files

▶️ How to Run

Make sure you are using Python 3.8+

pip install aiohttp lxml pandas

Run the scraper (script or notebook):

await page_scrapper()

CSV files will be generated inside:

new_scraper_data/

📈 Performance Comparison

Version	Approach	Time Taken
Old	BeautifulSoup + Sync	~24 minutes
New	aiohttp + asyncio	~2.30 minutes

✅ ~10x faster execution

🎯 Key Learnings

Importance of asynchronous I/O for web scraping
Efficient task scheduling with asyncio
Faster parsing using lxml
Structuring scalable scrapers
Clean data export using pandas

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
New Scraper Data		New Scraper Data
Old Scraper Data		Old Scraper Data
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
Web_Scraping_Asyncio.ipynb		Web_Scraping_Asyncio.ipynb
Web_Scraping_BeatifulSoup.ipynb		Web_Scraping_BeatifulSoup.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 BookToScrape – Async Web Scraper (Python)

🚀 Project Highlights

🧰 Libraries Used

🧠 Data Extracted

📘 Product Data

🧭 Sidebar Data

🕸️ How It Works

▶️ How to Run

📈 Performance Comparison

🎯 Key Learnings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 BookToScrape – Async Web Scraper (Python)

🚀 Project Highlights

🧰 Libraries Used

🧠 Data Extracted

📘 Product Data

🧭 Sidebar Data

🕸️ How It Works

▶️ How to Run

📈 Performance Comparison

🎯 Key Learnings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages