Add examples for new quotes.toscrape.com endpoints (fixes #15)#16
Open
rjanain wants to merge 7 commits intoscrapy:masterfrom
Open
Add examples for new quotes.toscrape.com endpoints (fixes #15)#16rjanain wants to merge 7 commits intoscrapy:masterfrom
rjanain wants to merge 7 commits intoscrapy:masterfrom
Conversation
Add text, author, and tags fields to QuotesbotItem to properly structure the scraped quote data. This makes the item definition more explicit and useful for all spiders in the project.
Add 6 new spiders to demonstrate various modern web scraping scenarios: - toscrape-js: Extract data from JavaScript-rendered content by parsing embedded JSON data in script tags - toscrape-scroll: Handle infinite scroll pages using the API endpoint with JSON responses - toscrape-login: Demonstrate form-based authentication with CSRF token handling using FormRequest.from_response - toscrape-table: Scrape data from table layouts by selecting table rows and cells - toscrape-viewstate: Handle ASP.NET ViewState forms commonly found in legacy enterprise applications - toscrape-random: Scrape single random quote endpoint These spiders provide practical examples for students learning to handle the challenges of modern websites beyond basic HTML parsing.
Enhance the README to provide comprehensive documentation for all spiders: - Add "What's New" section highlighting the modern scraping techniques - Organize spiders into Basic and Advanced categories for better learning - Add detailed descriptions of each spider's purpose and technique - Include Installation section with setup instructions - Add Example Commands section with practical usage examples - Add Tips for Students section with learning recommendations - Provide suggested learning path from basic to advanced techniques This makes the project more accessible to students learning modern web scraping and clearly communicates the value of the new additions.
Update documentation to emphasize that: - quotes.toscrape.com provides 8 different endpoints for learning - Each endpoint teaches specific modern scraping techniques - QuotesBot now provides complete coverage of all endpoints - Spiders are mapped directly to their target endpoints Changes: - Reframe "About This Project" to focus on endpoint coverage - Map each spider to its specific endpoint URL - Organize by technique category (HTML, JS/API, Auth, Layouts) - Add direct endpoint URLs in "Tips for Learning" section - Update learning path to progress through endpoints logically This positions QuotesBot as the complete companion for the quotes.toscrape.com learning sandbox.
…dize string formatting
Fix formatting issue in README for Basic Spiders section.
Added installation instructions for Scrapy in quotesbot.
Gallaecio
reviewed
Feb 5, 2026
Comment on lines
-20
to
+48
|
|
||
| This project contains two spiders and you can list them using the `list` | ||
| command: | ||
| piders for each endpoint available on quotes.toscrape.com. You can list them using the `list` command: | ||
|
|
||
| $ scrapy list | ||
| toscrape-css | ||
| toscrape-xpath | ||
| toscrape-scroll | ||
| toscrape-js | ||
| toscrape-login | ||
| toscrape-table | ||
| toscrape-viewstate | ||
| toscrape-random | ||
|
|
||
| Each spider targets a specific quotes.toscrape.com endpoint designed to teach different scraping techniques | ||
| Each spider targets a different endpoint or challenge on the website: |
Member
There was a problem hiding this comment.
There seems to be some artifacts here.
Gallaecio
reviewed
Feb 5, 2026
Comment on lines
+23
to
+35
| # Now scrape the quotes | ||
| for quote in response.css("div.quote"): | ||
| yield QuotesbotItem( | ||
| text=quote.css("span.text::text").get(), | ||
| author=quote.css("small.author::text").get(), | ||
| tags=quote.css("div.tags > a.tag::text").getall(), | ||
| ) | ||
|
|
||
| next_page = response.css("li.next > a::attr(href)").get() | ||
| if next_page: | ||
| yield scrapy.Request( | ||
| response.urljoin(next_page), callback=self.after_login | ||
| ) |
Member
There was a problem hiding this comment.
Is there a point to scrape anything from this spider? There is no content behind the login, so it would extract the same as if you were not logged. I wonder if we should just log whether or not the login was successful.
Gallaecio
reviewed
Feb 5, 2026
Comment on lines
+16
to
+18
| # To get multiple random quotes, we can yield new requests to the same URL | ||
| # Be careful not to create an infinite loop if not intended. | ||
| # Here we'll just stop after one, or we could use a counter. |
Member
There was a problem hiding this comment.
What about removing this comment, and instead define a QUOTE_COUNT constant at the module level, e.g. set to 3, and add * QUOTE_COUNT* to the end of the start_urls` line?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds spiders for 6 new endpoints available on quotes.toscrape.com, providing comprehensive coverage of modern web scraping techniques. Currently, quotesbot only covers 2 out of 8 endpoints on the sandbox - this update brings complete endpoint coverage.
Closes #15
The quotes.toscrape.com sandbox has evolved significantly with new endpoints designed to teach modern web scraping challenges. While quotesbot has remained an excellent starting point with CSS and XPath examples, it doesn't demonstrate techniques for JavaScript rendering, APIs, authentication, and other scenarios that students encounter in real-world scraping projects.
Changes
1. New Spiders for Modern Endpoints
JavaScript & API Handling:
toscrape-js.py →
/js/endpoint<script>tags asvar data = [...]toscrape-scroll.py →
/api/quotes?page=Nendpointhas_nextandpagefieldsAuthentication & Forms:
toscrape-login.py →
/loginendpointFormRequest.from_response()for automatic CSRF handlingtoscrape-viewstate.py →
/search.aspxendpoint__VIEWSTATEhidden fieldsComplex Layouts:
toscrape-table.py →
/tableful/endpointtoscrape-random.py →
/randomendpoint2. Updated Existing Spiders
QuotesbotIteminstead of plain dicts for consistency.extract_first()and.extract()with.get()and.getall()3. Enhanced Data Model
items.py: Added explicit field definitions fortext,author, andtagspasswith commented placeholder4. Comprehensive Documentation