diff --git a/README.md b/README.md index c33bc3f..c784245 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,23 @@ # QuotesBot + This is a Scrapy project to scrape quotes from famous people from http://quotes.toscrape.com ([github repo](https://github.com/scrapinghub/spidyquotes)). This project is only meant for educational purposes. +## About This Project + +The [quotes.toscrape.com](http://quotes.toscrape.com) website is a scraping sandbox designed to teach web scraping techniques. It provides multiple endpoints, each demonstrating different challenges that modern web scrapers encounter: + +- Basic HTML pages (CSS/XPath selectors) +- JavaScript-rendered content +- Infinite scroll with API backends +- Login-protected pages with CSRF tokens +- Table-based layouts +- Form submissions with ViewState +- Dynamic/random content endpoints + +**QuotesBot provides working spider examples for all of these endpoints**, making it a complete learning companion for the quotes.toscrape.com sandbox. Each spider demonstrates the appropriate Scrapy techniques for handling its target endpoint's specific challenges. + ## Extracted data @@ -17,20 +32,107 @@ The extracted data looks like this sample: ## Spiders - -This project contains two spiders and you can list them using the `list` -command: +piders for each endpoint available on quotes.toscrape.com. You can list them using the `list` command: $ scrapy list toscrape-css toscrape-xpath + toscrape-scroll + toscrape-js + toscrape-login + toscrape-table + toscrape-viewstate + toscrape-random + +Each spider targets a specific quotes.toscrape.com endpoint designed to teach different scraping techniques +Each spider targets a different endpoint or challenge on the website: + +### Basic Spiders +- **`toscrape-css`** → `/` endpoint + - Standard HTML scraping using CSS selectors + - Best starting point for Scrapy beginners + +- **`toscrape-xpath`** → `/` endpoint + - Same endpoint as above but using XPath expressions + - Learn alternative selector strategy + +### JavaScript & API Endpoints +- **`toscrape-js`** → `/js/` endpoint + - Extracts data from JavaScript-rendered pages + - Parses JSON embedded in `