Skip to content

Add a scraper check utility #124

Open
@benoit74

Description

@benoit74

Currently, we rely on various objects in scraperlib to:

  • create the ZIM
  • re-encode videos and images
  • cache these assets on the optimization cache

We might consider to have a mechanism to perform sanity checks on scraper behavior:

  • did we cached all re-encoded images / videos when a cache is present?
  • did we removed temporary files from the filesystem as they are added to the ZIM? (we know that while we prefer in-memory/streaming approaches, there are still many scrapers which are using the temporary file approach, and even some situation which have to rely on it)

What I do not yet know:

  • should we make the scraper fails if these checks fails?
  • is there any chance we automate these checks? (i.e. no need to modify the scrapers, or as little as possible - at least not make a call to "check_i_m_ok" mandatory, because the scraper developers might forget about it as well ; I doubt about this because there are many kind of situations)
  • can we do these checks early? (so that we fail the scraper asap instead of wasting time and resources)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions