An asynchronous web scraping project built with Python, aiohttp, asyncio, and lxml to extract book data from the Books To Scrape demo website.
This project is an upgraded version of my first-ever web scraper, redesigned for performance, scalability, and cleaner architecture.
-
⚡ Async-first design using
asyncioandaiohttp -
🕷️ Scrapes all books across paginated pages
-
📦 Extracts detailed product information from individual product pages
-
🧭 Collects sidebar category links
-
📁 Saves structured data into CSV files
-
⏱️ Massive performance improvement
- Old (BeautifulSoup, synchronous): ~24 minutes
- New (Async): ~2.5 minutes
aiohttp– Asynchronous HTTP requestsasyncio– Concurrent task executionlxml– Fast and powerful HTML parsingpandas– Data cleaning and CSV generationurllib.parse– URL handling and normalization
For each book:
- Title
- Description
- Rating
- Image URL
- Product Page URL
- UPC
- Price (including & excluding tax)
- Availability
- Product Type
- Number of Reviews
- Category name
- Category URL
- Fetches the homepage
- Extracts sidebar categories
- Iterates through all paginated product listing pages
- Concurrently visits each product page
- Scrapes and aggregates structured product data
- Exports data to CSV files
Make sure you are using Python 3.8+
pip install aiohttp lxml pandasRun the scraper (script or notebook):
await page_scrapper()CSV files will be generated inside:
new_scraper_data/
| Version | Approach | Time Taken |
|---|---|---|
| Old | BeautifulSoup + Sync | ~24 minutes |
| New | aiohttp + asyncio | ~2.30 minutes |
✅ ~10x faster execution
- Importance of asynchronous I/O for web scraping
- Efficient task scheduling with asyncio
- Faster parsing using lxml
- Structuring scalable scrapers
- Clean data export using
pandas