Skip to content

Enhancement: Inversion of control scraper base class #195

Open
@josephlewis42

Description

@josephlewis42

When recently building the DevDocs scraper, I realized there are a ton of things that I was relying on @benoit74's expertise for to make the scraper sustainable for ZimFarm but weren't core to building a functional scraper. These include:

  • Caching if S3 is available.
  • Expected CLI flags.
  • Zim file parameters/naming/destinations.
  • ZimFarm operator friendly logging setup.
  • Progress tracker.
  • HTTP client setup.

The scraper still isn't fully there, but I've already spent a lot of time implementing and testing some of these things. I've written Logstash plugins and Elastic Beats (which are more smiilar to Zim scrapers and those were dramatically easier because that common logic was abstracted away.

I'd love to see something like the following (example only!) as what I had to build while knowing most of the above would be taken care of:

# New type ZimMetadata contains the properties to populate config_metadata() on a Zim.
# There are specific types based on whether the scraper is for a single Zim or multiple (in which case
# it supports placeholders).
# Methods can be overridden for fine-grained control e.g. to add additional formatting parameters.
M = TypeVar('M', bound=ZimMetadata) 

class MyScraper(MultiZimScraper):

  # Parent class includes a logger, HTTP client, potentially other items.

  def add_flags(parser: argparse.ArgumentParser):
    '''Add custom flags to the program'''
    pass
  
  def setup(namespace: argparse.Namespace):
    '''Parse flags and set up resources for execution.
    After this call, MultiZimScraper may have additional internal variables set up
    e.g. an HTTP client that automatically caches to S3 if running in ZimFarm and with
    retries/delay.
    '''
    pass
  
  def list_zims() -> M:
    '''Called after setup to list all Zims to be created. '''
    pass
  
  def add_contents(creator: Creator, metadata: M):
    '''Called for each item in list_zims().
    
    The JSON progress file is updated between calls, logs for progress/next ZIM/timing are written
    and a scraper check utility could be asserted after.
    '''
    pass

I don't think all scrapers would need to use this format, but something like it would have dramatically cut down on the amount of testing and knowledge needed for me to produce a quality Zim scraper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions