-
Notifications
You must be signed in to change notification settings - Fork 3
Description
In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.
Sources potentially needed:
- elpitazo.com
- twitter.com
- primicia.com.ve
- efectotocuyo.com
- laprensalara.com.ve
- diariolosandes
- elimpulso.com
- el-carabobeno.com
- cronica.uno
- elnacional.com
- eluniversal.com
Proposed Solution
By Luis,
Scraper Creation
In this guide, we will go through the process of creating a new scraper, which can be summed up
in the following steps:
- Select an output data format
- Implement a BaseScraper subclass
- wire the new scraper to installed ones
Selecting an output data format
Every page may have different scrapable information, maybe hashtags in
twitter, news section name for some news site. In any case, we don't want to
lose such a valuable information. Select one of the available ones if you think it fits your needs.
If you don't see any existing data format
in scraper/scraped_data_classes fitting your scrapable data, you can
write a new one by creating a file in scraper/scraped_data_classes implementing the base class BaseDataFormat located in scraper/scraped_data_classes/base_scraped_data.py. Such class should implement the
to_scraped_data : (self) -> ScrapedData. That method will map from your data format to our currently supported database scheme (represented by the ScrapedData class).
This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.
Implementing BaseScraper subclass
This step depends on the kind of scraper you want to write.
You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.
Scrapy based scrapers:
- Create a scrapy spider as you would usually do, save it in
scraper/spiders. Its parse method should return the data format selected in the previous step. - Create a file/module in
scraper/scrapersimplementing a class inheritingBaseScrapyScraperlocated inscraper/scrapers/base_scrapy_scraper.py. - The only thing that class should add is two class variables:
intended_domain : str= domain intended to by scraped by this scraperspider : Type[Spider]= spider defined in step 1
From scratch
- Define a new file/module in
scraper/scraperswith a class inheriting and implementingBaseScraperclass located inscraper/scrapers/base_scraper.py - Such class should implement, at the least, the following methods:
parse(self, responde : Any) -> ScrapedData: function to get data from succesfull response to pagescrape(self, url : Str) -> ScrapedData: Function to get data from response (which may be an arbitrary type depending on implementation details)
Note that every other method is still overridable
Wiring the new scraper
Just go to the scraper/settings.py file, import your new scraper and add it to the list INSTALLED_SCRAPERS
