Skip to content

Create toy scraper for main sources #67

@marianelamin

Description

@marianelamin

In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.

Sources potentially needed:

  • elpitazo.com
  • twitter.com
  • primicia.com.ve
  • efectotocuyo.com
  • laprensalara.com.ve
  • diariolosandes
  • elimpulso.com
  • el-carabobeno.com
  • cronica.uno
  • elnacional.com
  • eluniversal.com

image

Proposed Solution

By Luis,

Scraper Creation

In this guide, we will go through the process of creating a new scraper, which can be summed up
in the following steps:

  1. Select an output data format
  2. Implement a BaseScraper subclass
  3. wire the new scraper to installed ones

Selecting an output data format


Every page may have different scrapable information, maybe hashtags in
twitter, news section name for some news site. In any case, we don't want to
lose such a valuable information. Select one of the available ones if you think it fits your needs.

If you don't see any existing data format
in scraper/scraped_data_classes fitting your scrapable data, you can
write a new one by creating a file in scraper/scraped_data_classes implementing the base class BaseDataFormat located in scraper/scraped_data_classes/base_scraped_data.py. Such class should implement the
to_scraped_data : (self) -> ScrapedData. That method will map from your data format to our currently supported database scheme (represented by the ScrapedData class).

This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.

Implementing BaseScraper subclass


This step depends on the kind of scraper you want to write.
You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.

Scrapy based scrapers:

  1. Create a scrapy spider as you would usually do, save it in scraper/spiders. Its parse method should return the data format selected in the previous step.
  2. Create a file/module in scraper/scrapers implementing a class inheriting BaseScrapyScraper located in scraper/scrapers/base_scrapy_scraper.py.
  3. The only thing that class should add is two class variables:
    • intended_domain : str = domain intended to by scraped by this scraper
    • spider : Type[Spider] = spider defined in step 1

From scratch

  1. Define a new file/module in scraper/scrapers with a class inheriting and implementing BaseScraper class located in scraper/scrapers/base_scraper.py
  2. Such class should implement, at the least, the following methods:
    • parse(self, responde : Any) -> ScrapedData : function to get data from succesfull response to page
    • scrape(self, url : Str) -> ScrapedData : Function to get data from response (which may be an arbitrary type depending on implementation details)

Note that every other method is still overridable

Wiring the new scraper


Just go to the scraper/settings.py file, import your new scraper and add it to the list INSTALLED_SCRAPERS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions