Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 136 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,23 @@
# QuotesBot

This is a Scrapy project to scrape quotes from famous people from http://quotes.toscrape.com ([github repo](https://github.com/scrapinghub/spidyquotes)).

This project is only meant for educational purposes.

## About This Project

The [quotes.toscrape.com](http://quotes.toscrape.com) website is a scraping sandbox designed to teach web scraping techniques. It provides multiple endpoints, each demonstrating different challenges that modern web scrapers encounter:

- Basic HTML pages (CSS/XPath selectors)
- JavaScript-rendered content
- Infinite scroll with API backends
- Login-protected pages with CSRF tokens
- Table-based layouts
- Form submissions with ViewState
- Dynamic/random content endpoints

**QuotesBot provides working spider examples for all of these endpoints**, making it a complete learning companion for the quotes.toscrape.com sandbox. Each spider demonstrates the appropriate Scrapy techniques for handling its target endpoint's specific challenges.


## Extracted data

Expand All @@ -17,20 +32,107 @@ The extracted data looks like this sample:


## Spiders

This project contains two spiders and you can list them using the `list`
command:
piders for each endpoint available on quotes.toscrape.com. You can list them using the `list` command:

$ scrapy list
toscrape-css
toscrape-xpath
toscrape-scroll
toscrape-js
toscrape-login
toscrape-table
toscrape-viewstate
toscrape-random

Each spider targets a specific quotes.toscrape.com endpoint designed to teach different scraping techniques
Each spider targets a different endpoint or challenge on the website:
Comment on lines -20 to +48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some artifacts here.


### Basic Spiders
- **`toscrape-css`** → `/` endpoint
- Standard HTML scraping using CSS selectors
- Best starting point for Scrapy beginners

- **`toscrape-xpath`** → `/` endpoint
- Same endpoint as above but using XPath expressions
- Learn alternative selector strategy

### JavaScript & API Endpoints
- **`toscrape-js`** → `/js/` endpoint
- Extracts data from JavaScript-rendered pages
- Parses JSON embedded in `<script>` tags as `var data = [...]`
- Demonstrates reverse-engineering JS content
**Suggested progression through the quotes.toscrape.com endpoints:**

1. **Start with HTML basics**:
- `toscrape-css` and `toscrape-xpath` - Learn fundamental selectors

2. **Explore JavaScript & APIs**:
- `toscrape-js` - Understand JS-rendered content
- `toscrape-scroll` - Learn to use APIs directly

3. **Master authentication**:
- `toscrape-login` - Handle login forms and sessions

4. **Tackle complex structures**:
- `toscrape-table` - Parse table layouts
- `toscrape-viewstate` - Handle stateful forms
- `toscrape-random` - Work with dynamic content

Each spider is a complete, working example you can run, inspect, and modify. This hands-on approach with real endpoints helps you understand the techniques needed for production web scraping.
- **`toscrape-login`** → `/login` endpoint
- Demonstrates form-based authentication flow
- Uses `FormRequest.from_response()` for automatic CSRF token handling
- Scrapes content only accessible after login

- **`toscrape-viewstate`** → `/search.aspx` endpoint
- Handles ASP.NET ViewState forms
- Extracts and submits hidden `__VIEWSTATE` fields
- Teaches stateful form submissions for enterprise applications

### Complex Layouts
- **`toscrape-table`** → `/tableful/` endpoint
- Parses quotes organized in HTML table structure
- Iterates through table rows and extracts cell data
- Demonstrates table scraping patterns

- **`toscrape-random`** → `/random` endpoint
- Scrapes endpoint that returns different content each time
- Handles dynamic/random content sources
### Learning Path

1. **Start with basics**: Run `toscrape-css` and `toscrape-xpath` to understand fundamental Scrapy concepts
2. **Progress to modern challenges**: Try the advanced spiders to learn techniques for:
- Handling JavaScript-heavy websites
- Working with APIs and JSON data
- Managing authentication flows
- Parsing complex layouts
- Dealing with form-based navigation

You can learn more about Scrapy fundamentals by going through the
[Scrapy Tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html).

Both spiders extract the same data from the same website, but `toscrape-css`
employs CSS selectors, while `toscrape-xpath` employs XPath expressions.
Learning

You can learn more about the spiders by going through the
[Scrapy Tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html).
- **Explore each endpoint first**: Visit each URL in your browser to understand what you're scraping
- Basic: `http://quotes.toscrape.com/`
- JavaScript: `http://quotes.toscrape.com/js/`
- Scroll/API: `http://quotes.toscrape.com/scroll/` (uses `/api/quotes?page=1`)
- Login: `http://quotes.toscrape.com/login`
- Table: `http://quotes.toscrape.com/tableful/`
- ViewState: `http://quotes.toscrape.com/search.aspx`
- Random: `http://quotes.toscrape.com/random`

- **Use browser DevTools**: Inspect the page source, network requests, and JavaScript to understand how each endpoint works

- **Compare approaches**: Run `toscrape-css` vs `toscrape-xpath` on the same endpoint to see different selector strategies

- **Check API responses**: For `toscrape-scroll`, visit the API endpoint directly to see the JSON structure

- **Experiment freely**: quotes.toscrape.com is a sandbox designed for learning - modify the spiders and try different technique
```
cd quotesbot
pip install scrapy
```

## Running the spiders

Expand All @@ -41,3 +143,30 @@ You can run a spider using the `scrapy crawl` command, such as:
If you want to save the scraped data to a file, you can pass the `-o` option:

$ scrapy crawl toscrape-css -o quotes.json

### Example Commands

```bash
# Basic scraping with CSS selectors
scrapy crawl toscrape-css -o quotes-css.json

# Scrape JavaScript-rendered content
scrapy crawl toscrape-js -o quotes-js.json

# Scrape with authentication
scrapy crawl toscrape-login -o quotes-login.json

# Scrape using API (infinite scroll)
scrapy crawl toscrape-scroll -o quotes-scroll.json

# Scrape from table layout
scrapy crawl toscrape-table -o quotes-table.json
```

## Tips for Students

- **Compare CSS vs XPath**: Run both `toscrape-css` and `toscrape-xpath` to see different selector strategies
- **Inspect the target websites**: Use browser DevTools to understand page structure before writing selectors
- **Check the API**: For `toscrape-scroll`, visit `http://quotes.toscrape.com/api/quotes?page=1` in your browser to see the JSON structure
- **Authentication testing**: The login credentials for `toscrape-login` are typically test credentials provided by the site
- **Experiment safely**: This sandbox is designed for learning, so feel free to modify and experiment with the spiders
6 changes: 3 additions & 3 deletions quotesbot/items.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@


class QuotesbotItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
16 changes: 8 additions & 8 deletions quotesbot/spiders/toscrape-css.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeCSSSpider(scrapy.Spider):
name = "toscrape-css"
start_urls = [
'http://quotes.toscrape.com/',
"http://quotes.toscrape.com/",
]

def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").extract_first(),
'author': quote.css("small.author::text").extract_first(),
'tags': quote.css("div.tags > a.tag::text").extract()
}
yield QuotesbotItem(
text=quote.css("span.text::text").get(),
author=quote.css("small.author::text").get(),
tags=quote.css("div.tags > a.tag::text").getall(),
)

next_page_url = response.css("li.next > a::attr(href)").extract_first()
next_page_url = response.css("li.next > a::attr(href)").get()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))

40 changes: 40 additions & 0 deletions quotesbot/spiders/toscrape-js.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import json
import re
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeJSSpider(scrapy.Spider):
name = "toscrape-js"
start_urls = ["http://quotes.toscrape.com/js/"]

def parse(self, response):
script_data = response.xpath(
'//script[contains(text(), "var data =")]/text()'
).get()
if script_data:
# Extract the JSON list from the script text
match = re.search(
r"var data = (\[.*?\]);", script_data, re.DOTALL
)
if match:
json_str = match.group(1)
data = json.loads(json_str)
for quote in data:
yield QuotesbotItem(
text=quote["text"],
author=quote["author"]["name"],
tags=quote["tags"],
)

# Pagination for JS page usually follows the same pattern or links
# But on the JS page, the "Next" button is also JS generated.
# However, the URL structure /js/page/2/ usually works or we can find the link in the data if present.
# For this example, let's assume we just want to show how to extract data from the script.
# If we want pagination, we might need to check if the script has 'next' info or just increment page number.
# Let's check if there is a next page link in the HTML (often there is a <li class="next"> even if hidden or generated).
# Actually, on quotes.toscrape.com/js, the pagination links are normal <a> tags but the content is JS.

next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
37 changes: 37 additions & 0 deletions quotesbot/spiders/toscrape-login.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeLoginSpider(scrapy.Spider):
name = "toscrape-login"
login_url = "http://quotes.toscrape.com/login"
start_urls = [login_url]

def parse(self, response):
# Submit the login form
# CSRF token is handled automatically by FormRequest.from_response if it's in a hidden field
return scrapy.FormRequest.from_response(
response,
formdata={"username": "myuser", "password": "mypassword"},
callback=self.after_login,
)

def after_login(self, response):
# Check if login succeeded
if "Logout" in response.text:
self.logger.info("Login successful!")
# Now scrape the quotes
for quote in response.css("div.quote"):
yield QuotesbotItem(
text=quote.css("span.text::text").get(),
author=quote.css("small.author::text").get(),
tags=quote.css("div.tags > a.tag::text").getall(),
)

next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(
response.urljoin(next_page), callback=self.after_login
)
Comment on lines +23 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a point to scrape anything from this spider? There is no content behind the login, so it would extract the same as if you were not logged. I wonder if we should just log whether or not the login was successful.

else:
self.logger.error("Login failed")
18 changes: 18 additions & 0 deletions quotesbot/spiders/toscrape-random.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeRandomSpider(scrapy.Spider):
name = "toscrape-random"
start_urls = ["http://quotes.toscrape.com/random"]

def parse(self, response):
yield QuotesbotItem(
text=response.css("span.text::text").get(),
author=response.css("small.author::text").get(),
tags=response.css("div.tags > a.tag::text").getall(),
)

# To get multiple random quotes, we can yield new requests to the same URL
# Be careful not to create an infinite loop if not intended.
# Here we'll just stop after one, or we could use a counter.
Comment on lines +16 to +18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about removing this comment, and instead define a QUOTE_COUNT constant at the module level, e.g. set to 3, and add * QUOTE_COUNT* to the end of the start_urls` line?

22 changes: 22 additions & 0 deletions quotesbot/spiders/toscrape-scroll.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import json
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeScrollSpider(scrapy.Spider):
name = "toscrape-scroll"
start_urls = ["http://quotes.toscrape.com/api/quotes?page=1"]

def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
yield QuotesbotItem(
text=quote["text"], author=quote["author"]["name"], tags=quote["tags"]
)

if data["has_next"]:
next_page = data["page"] + 1
yield scrapy.Request(
url=f"http://quotes.toscrape.com/api/quotes?page={next_page}",
callback=self.parse,
)
24 changes: 24 additions & 0 deletions quotesbot/spiders/toscrape-table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeTableSpider(scrapy.Spider):
name = "toscrape-table"
start_urls = ["http://quotes.toscrape.com/tableful"]

def parse(self, response):
# The tableful page has quotes in a table structure
# Each row contains a quote with text, author, and tags
# We use .quote class to select quote containers within the table

for quote in response.css("div.quote"):
yield QuotesbotItem(
text=quote.css("span.text::text").get(),
author=quote.css("small.author::text").get(),
tags=quote.css("div.tags > a.tag::text").getall(),
)

# Pagination
next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
37 changes: 37 additions & 0 deletions quotesbot/spiders/toscrape-viewstate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import scrapy
from quotesbot.items import QuotesbotItem


class ToScrapeViewStateSpider(scrapy.Spider):
name = "toscrape-viewstate"
start_urls = ["http://quotes.toscrape.com/search.aspx"]

def parse(self, response):
# Extract quotes from the current page first
for quote in response.css("div.quote"):
yield QuotesbotItem(
text=quote.css("span.text::text").get(),
author=quote.css("small.author::text").get(),
tags=quote.css("div.tags > a.tag::text").getall(),
)

# Demonstrate ViewState form submission
# Check if there's a filter form (tag dropdown/input)
if response.css('form select[name="tag"], form input[name="tag"]'):
# Submit form with a tag filter using FormRequest.from_response
# which automatically handles __VIEWSTATE and other hidden fields
yield scrapy.FormRequest.from_response(
response,
formdata={"tag": "love"},
callback=self.parse_filtered_results,
dont_click=True, # Don't simulate button click, just submit
)

def parse_filtered_results(self, response):
# Parse filtered results
for quote in response.css("div.quote"):
yield QuotesbotItem(
text=quote.css("span.text::text").get(),
author=quote.css("small.author::text").get(),
tags=quote.css("div.tags > a.tag::text").getall(),
)
Loading