scrapy · rjanain · Dec 17, 2025 · Dec 17, 2025 · Dec 17, 2025 · Dec 17, 2025
diff --git a/README.md b/README.md
@@ -1,8 +1,23 @@
 # QuotesBot
+
 This is a Scrapy project to scrape quotes from famous people from http://quotes.toscrape.com ([github repo](https://github.com/scrapinghub/spidyquotes)).
 
 This project is only meant for educational purposes.
 
+## About This Project
+
+The [quotes.toscrape.com](http://quotes.toscrape.com) website is a scraping sandbox designed to teach web scraping techniques. It provides multiple endpoints, each demonstrating different challenges that modern web scrapers encounter:
+
+- Basic HTML pages (CSS/XPath selectors)
+- JavaScript-rendered content
+- Infinite scroll with API backends
+- Login-protected pages with CSRF tokens
+- Table-based layouts
+- Form submissions with ViewState
+- Dynamic/random content endpoints
+
+**QuotesBot provides working spider examples for all of these endpoints**, making it a complete learning companion for the quotes.toscrape.com sandbox. Each spider demonstrates the appropriate Scrapy techniques for handling its target endpoint's specific challenges.
+
 
 ## Extracted data
 
@@ -17,20 +32,107 @@ The extracted data looks like this sample:
 
 
 ## Spiders
-
-This project contains two spiders and you can list them using the `list`
-command:
+piders for each endpoint available on quotes.toscrape.com. You can list them using the `list` command:
 
     $ scrapy list
     toscrape-css
     toscrape-xpath
+    toscrape-scroll
+    toscrape-js
+    toscrape-login
+    toscrape-table
+    toscrape-viewstate
+    toscrape-random
+
+Each spider targets a specific quotes.toscrape.com endpoint designed to teach different scraping techniques
+Each spider targets a different endpoint or challenge on the website:
+
+### Basic Spiders
+- **`toscrape-css`** → `/` endpoint
+  - Standard HTML scraping using CSS selectors
+  - Best starting point for Scrapy beginners
+
+- **`toscrape-xpath`** → `/` endpoint
+  - Same endpoint as above but using XPath expressions
+  - Learn alternative selector strategy
+
+### JavaScript & API Endpoints
+- **`toscrape-js`** → `/js/` endpoint
+  - Extracts data from JavaScript-rendered pages
+  - Parses JSON embedded in `<script>` tags as `var data = [...]`
+  - Demonstrates reverse-engineering JS content
+**Suggested progression through the quotes.toscrape.com endpoints:**
+
+1. **Start with HTML basics**: 
+   - `toscrape-css` and `toscrape-xpath` - Learn fundamental selectors
+
+2. **Explore JavaScript & APIs**:
+   - `toscrape-js` - Understand JS-rendered content
+   - `toscrape-scroll` - Learn to use APIs directly
+
+3. **Master authentication**:
+   - `toscrape-login` - Handle login forms and sessions
+
+4. **Tackle complex structures**:
+   - `toscrape-table` - Parse table layouts
+   - `toscrape-viewstate` - Handle stateful forms
+   - `toscrape-random` - Work with dynamic content
+
+Each spider is a complete, working example you can run, inspect, and modify. This hands-on approach with real endpoints helps you understand the techniques needed for production web scraping.
+- **`toscrape-login`** → `/login` endpoint
+  - Demonstrates form-based authentication flow
+  - Uses `FormRequest.from_response()` for automatic CSRF token handling
+  - Scrapes content only accessible after login
+
+- **`toscrape-viewstate`** → `/search.aspx` endpoint
+  - Handles ASP.NET ViewState forms
+  - Extracts and submits hidden `__VIEWSTATE` fields
+  - Teaches stateful form submissions for enterprise applications
+
+### Complex Layouts
+- **`toscrape-table`** → `/tableful/` endpoint
+  - Parses quotes organized in HTML table structure
+  - Iterates through table rows and extracts cell data
+  - Demonstrates table scraping patterns
+
+- **`toscrape-random`** → `/random` endpoint
+  - Scrapes endpoint that returns different content each time
+  - Handles dynamic/random content sources
+### Learning Path
+
+1. **Start with basics**: Run `toscrape-css` and `toscrape-xpath` to understand fundamental Scrapy concepts
+2. **Progress to modern challenges**: Try the advanced spiders to learn techniques for:
+   - Handling JavaScript-heavy websites
+   - Working with APIs and JSON data
+   - Managing authentication flows
+   - Parsing complex layouts
+   - Dealing with form-based navigation
+
+You can learn more about Scrapy fundamentals by going through the
+[Scrapy Tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html).
 
-Both spiders extract the same data from the same website, but `toscrape-css`
-employs CSS selectors, while `toscrape-xpath` employs XPath expressions.
+Learning
 
-You can learn more about the spiders by going through the
-[Scrapy Tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html).
+- **Explore each endpoint first**: Visit each URL in your browser to understand what you're scraping
+  - Basic: `http://quotes.toscrape.com/`
+  - JavaScript: `http://quotes.toscrape.com/js/`
+  - Scroll/API: `http://quotes.toscrape.com/scroll/` (uses `/api/quotes?page=1`)
+  - Login: `http://quotes.toscrape.com/login`
+  - Table: `http://quotes.toscrape.com/tableful/`
+  - ViewState: `http://quotes.toscrape.com/search.aspx`
+  - Random: `http://quotes.toscrape.com/random`
+
+- **Use browser DevTools**: Inspect the page source, network requests, and JavaScript to understand how each endpoint works
+
+- **Compare approaches**: Run `toscrape-css` vs `toscrape-xpath` on the same endpoint to see different selector strategies
 
+- **Check API responses**: For `toscrape-scroll`, visit the API endpoint directly to see the JSON structure
+
+- **Experiment freely**: quotes.toscrape.com is a sandbox designed for learning - modify the spiders and try different technique
+```
+cd quotesbot
+pip install scrapy
+```
 
 ## Running the spiders
 
@@ -41,3 +143,30 @@ You can run a spider using the `scrapy crawl` command, such as:
 If you want to save the scraped data to a file, you can pass the `-o` option:
 
     $ scrapy crawl toscrape-css -o quotes.json
+
+### Example Commands
+
+```bash
+# Basic scraping with CSS selectors
+scrapy crawl toscrape-css -o quotes-css.json
+
+# Scrape JavaScript-rendered content
+scrapy crawl toscrape-js -o quotes-js.json
+
+# Scrape with authentication
+scrapy crawl toscrape-login -o quotes-login.json
+
+# Scrape using API (infinite scroll)
+scrapy crawl toscrape-scroll -o quotes-scroll.json
+
+# Scrape from table layout
+scrapy crawl toscrape-table -o quotes-table.json
+```
+
+## Tips for Students
+
+- **Compare CSS vs XPath**: Run both `toscrape-css` and `toscrape-xpath` to see different selector strategies
+- **Inspect the target websites**: Use browser DevTools to understand page structure before writing selectors
+- **Check the API**: For `toscrape-scroll`, visit `http://quotes.toscrape.com/api/quotes?page=1` in your browser to see the JSON structure
+- **Authentication testing**: The login credentials for `toscrape-login` are typically test credentials provided by the site
+- **Experiment safely**: This sandbox is designed for learning, so feel free to modify and experiment with the spiders
diff --git a/quotesbot/items.py b/quotesbot/items.py
@@ -9,6 +9,6 @@
 
 
 class QuotesbotItem(scrapy.Item):
-    # define the fields for your item here like:
-    # name = scrapy.Field()
-    pass
+    text = scrapy.Field()
+    author = scrapy.Field()
+    tags = scrapy.Field()
diff --git a/quotesbot/spiders/toscrape-css.py b/quotesbot/spiders/toscrape-css.py
@@ -1,22 +1,22 @@
 # -*- coding: utf-8 -*-
 import scrapy
+from quotesbot.items import QuotesbotItem
 
 
 class ToScrapeCSSSpider(scrapy.Spider):
     name = "toscrape-css"
     start_urls = [
-        'http://quotes.toscrape.com/',
+        "http://quotes.toscrape.com/",
     ]
 
     def parse(self, response):
         for quote in response.css("div.quote"):
-            yield {
-                'text': quote.css("span.text::text").extract_first(),
-                'author': quote.css("small.author::text").extract_first(),
-                'tags': quote.css("div.tags > a.tag::text").extract()
-            }
+            yield QuotesbotItem(
+                text=quote.css("span.text::text").get(),
+                author=quote.css("small.author::text").get(),
+                tags=quote.css("div.tags > a.tag::text").getall(),
+            )
 
-        next_page_url = response.css("li.next > a::attr(href)").extract_first()
+        next_page_url = response.css("li.next > a::attr(href)").get()
         if next_page_url is not None:
             yield scrapy.Request(response.urljoin(next_page_url))
-
diff --git a/quotesbot/spiders/toscrape-js.py b/quotesbot/spiders/toscrape-js.py
@@ -0,0 +1,40 @@
+import json
+import re
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeJSSpider(scrapy.Spider):
+    name = "toscrape-js"
+    start_urls = ["http://quotes.toscrape.com/js/"]
+
+    def parse(self, response):
+        script_data = response.xpath(
+            '//script[contains(text(), "var data =")]/text()'
+        ).get()
+        if script_data:
+            # Extract the JSON list from the script text
+            match = re.search(
+                r"var data = (\[.*?\]);", script_data, re.DOTALL
+            )
+            if match:
+                json_str = match.group(1)
+                data = json.loads(json_str)
+                for quote in data:
+                    yield QuotesbotItem(
+                        text=quote["text"],
+                        author=quote["author"]["name"],
+                        tags=quote["tags"],
+                    )
+
+        # Pagination for JS page usually follows the same pattern or links
+        # But on the JS page, the "Next" button is also JS generated.
+        # However, the URL structure /js/page/2/ usually works or we can find the link in the data if present.
+        # For this example, let's assume we just want to show how to extract data from the script.
+        # If we want pagination, we might need to check if the script has 'next' info or just increment page number.
+        # Let's check if there is a next page link in the HTML (often there is a <li class="next"> even if hidden or generated).
+        # Actually, on quotes.toscrape.com/js, the pagination links are normal <a> tags but the content is JS.
+
+        next_page = response.css("li.next > a::attr(href)").get()
+        if next_page:
+            yield scrapy.Request(response.urljoin(next_page))
diff --git a/quotesbot/spiders/toscrape-login.py b/quotesbot/spiders/toscrape-login.py
@@ -0,0 +1,37 @@
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeLoginSpider(scrapy.Spider):
+    name = "toscrape-login"
+    login_url = "http://quotes.toscrape.com/login"
+    start_urls = [login_url]
+
+    def parse(self, response):
+        # Submit the login form
+        # CSRF token is handled automatically by FormRequest.from_response if it's in a hidden field
+        return scrapy.FormRequest.from_response(
+            response,
+            formdata={"username": "myuser", "password": "mypassword"},
+            callback=self.after_login,
+        )
+
+    def after_login(self, response):
+        # Check if login succeeded
+        if "Logout" in response.text:
+            self.logger.info("Login successful!")
+            # Now scrape the quotes
+            for quote in response.css("div.quote"):
+                yield QuotesbotItem(
+                    text=quote.css("span.text::text").get(),
+                    author=quote.css("small.author::text").get(),
+                    tags=quote.css("div.tags > a.tag::text").getall(),
+                )
+
+            next_page = response.css("li.next > a::attr(href)").get()
+            if next_page:
+                yield scrapy.Request(
+                    response.urljoin(next_page), callback=self.after_login
+                )
+        else:
+            self.logger.error("Login failed")
diff --git a/quotesbot/spiders/toscrape-random.py b/quotesbot/spiders/toscrape-random.py
@@ -0,0 +1,18 @@
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeRandomSpider(scrapy.Spider):
+    name = "toscrape-random"
+    start_urls = ["http://quotes.toscrape.com/random"]
+
+    def parse(self, response):
+        yield QuotesbotItem(
+            text=response.css("span.text::text").get(),
+            author=response.css("small.author::text").get(),
+            tags=response.css("div.tags > a.tag::text").getall(),
+        )
+
+        # To get multiple random quotes, we can yield new requests to the same URL
+        # Be careful not to create an infinite loop if not intended.
+        # Here we'll just stop after one, or we could use a counter.
diff --git a/quotesbot/spiders/toscrape-scroll.py b/quotesbot/spiders/toscrape-scroll.py
@@ -0,0 +1,22 @@
+import json
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeScrollSpider(scrapy.Spider):
+    name = "toscrape-scroll"
+    start_urls = ["http://quotes.toscrape.com/api/quotes?page=1"]
+
+    def parse(self, response):
+        data = json.loads(response.text)
+        for quote in data["quotes"]:
+            yield QuotesbotItem(
+                text=quote["text"], author=quote["author"]["name"], tags=quote["tags"]
+            )
+
+        if data["has_next"]:
+            next_page = data["page"] + 1
+            yield scrapy.Request(
+                url=f"http://quotes.toscrape.com/api/quotes?page={next_page}",
+                callback=self.parse,
+            )
diff --git a/quotesbot/spiders/toscrape-table.py b/quotesbot/spiders/toscrape-table.py
@@ -0,0 +1,24 @@
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeTableSpider(scrapy.Spider):
+    name = "toscrape-table"
+    start_urls = ["http://quotes.toscrape.com/tableful"]
+
+    def parse(self, response):
+        # The tableful page has quotes in a table structure
+        # Each row contains a quote with text, author, and tags
+        # We use .quote class to select quote containers within the table
+
+        for quote in response.css("div.quote"):
+            yield QuotesbotItem(
+                text=quote.css("span.text::text").get(),
+                author=quote.css("small.author::text").get(),
+                tags=quote.css("div.tags > a.tag::text").getall(),
+            )
+
+        # Pagination
+        next_page = response.css("li.next > a::attr(href)").get()
+        if next_page:
+            yield scrapy.Request(response.urljoin(next_page))
diff --git a/quotesbot/spiders/toscrape-viewstate.py b/quotesbot/spiders/toscrape-viewstate.py
@@ -0,0 +1,37 @@
+import scrapy
+from quotesbot.items import QuotesbotItem
+
+
+class ToScrapeViewStateSpider(scrapy.Spider):
+    name = "toscrape-viewstate"
+    start_urls = ["http://quotes.toscrape.com/search.aspx"]
+
+    def parse(self, response):
+        # Extract quotes from the current page first
+        for quote in response.css("div.quote"):
+            yield QuotesbotItem(
+                text=quote.css("span.text::text").get(),
+                author=quote.css("small.author::text").get(),
+                tags=quote.css("div.tags > a.tag::text").getall(),
+            )
+
+        # Demonstrate ViewState form submission
+        # Check if there's a filter form (tag dropdown/input)
+        if response.css('form select[name="tag"], form input[name="tag"]'):
+            # Submit form with a tag filter using FormRequest.from_response
+            # which automatically handles __VIEWSTATE and other hidden fields
+            yield scrapy.FormRequest.from_response(
+                response,
+                formdata={"tag": "love"},
+                callback=self.parse_filtered_results,
+                dont_click=True,  # Don't simulate button click, just submit
+            )
+
+    def parse_filtered_results(self, response):
+        # Parse filtered results
+        for quote in response.css("div.quote"):
+            yield QuotesbotItem(
+                text=quote.css("span.text::text").get(),
+                author=quote.css("small.author::text").get(),
+                tags=quote.css("div.tags > a.tag::text").getall(),
+            )