Feature/improve scraping rules: Improve Scraping Rules & Error Handling#7
Open
heroheman wants to merge 5 commits intoscp-data:mainfrom
Open
Feature/improve scraping rules: Improve Scraping Rules & Error Handling#7heroheman wants to merge 5 commits intoscp-data:mainfrom
heroheman wants to merge 5 commits intoscp-data:mainfrom
Conversation
453a084 to
b7ec8bc
Compare
- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.
- this fixes unintend removal of Linkextractor Rule
- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes
- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present
b7ec8bc to
f2ececf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Improves crawling efficiency and robustness by filtering out irrelevant pages and enhancing error handling during history retrieval.
Changes
Tale Spider Optimization
Problem: Spider was crawling 13,792+ pages but only ~6,374 were actual tales, wasting 47+ minutes on irrelevant system pages.
Solution: Added
denyrules to filter out non-content pages:Impact: Reduces crawl time by avoiding ~7,400 unnecessary page requests.
Enhanced Error Handling
try-exceptblocks to gracefully handle missinghistorydatahistorykey exists in items before processing to preventKeyErrorsRobustness Improvements
historylookup fails (returns empty dict instead of crashing)historydata structure before sortingTesting