Skip to content

Vaisakh-Nirupam/Python_Async_Scrapper

Repository files navigation

📚 BookToScrape – Async Web Scraper (Python)

An asynchronous web scraping project built with Python, aiohttp, asyncio, and lxml to extract book data from the Books To Scrape demo website.

This project is an upgraded version of my first-ever web scraper, redesigned for performance, scalability, and cleaner architecture.


🚀 Project Highlights

  • Async-first design using asyncio and aiohttp

  • 🕷️ Scrapes all books across paginated pages

  • 📦 Extracts detailed product information from individual product pages

  • 🧭 Collects sidebar category links

  • 📁 Saves structured data into CSV files

  • ⏱️ Massive performance improvement

    • Old (BeautifulSoup, synchronous): ~24 minutes
    • New (Async): ~2.5 minutes

🧰 Libraries Used

  • aiohttp – Asynchronous HTTP requests
  • asyncio – Concurrent task execution
  • lxml – Fast and powerful HTML parsing
  • pandas – Data cleaning and CSV generation
  • urllib.parse – URL handling and normalization

🧠 Data Extracted

📘 Product Data

For each book:

  • Title
  • Description
  • Rating
  • Image URL
  • Product Page URL
  • UPC
  • Price (including & excluding tax)
  • Availability
  • Product Type
  • Number of Reviews

🧭 Sidebar Data

  • Category name
  • Category URL

🕸️ How It Works

  1. Fetches the homepage
  2. Extracts sidebar categories
  3. Iterates through all paginated product listing pages
  4. Concurrently visits each product page
  5. Scrapes and aggregates structured product data
  6. Exports data to CSV files

▶️ How to Run

Make sure you are using Python 3.8+

pip install aiohttp lxml pandas

Run the scraper (script or notebook):

await page_scrapper()

CSV files will be generated inside:

new_scraper_data/

📈 Performance Comparison

Version Approach Time Taken
Old BeautifulSoup + Sync ~24 minutes
New aiohttp + asyncio ~2.30 minutes

~10x faster execution


🎯 Key Learnings

  • Importance of asynchronous I/O for web scraping
  • Efficient task scheduling with asyncio
  • Faster parsing using lxml
  • Structuring scalable scrapers
  • Clean data export using pandas

About

Improved async version of an earlier Python web scraper, rebuilt with asyncio and aiohttp to crawl BooksToScrape and extract structured book data efficiently.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors