|
| 1 | +# Architecture & components |
| 2 | + |
| 3 | +The Microscope library is compound by components that can be summarized as: |
| 4 | + |
| 5 | +* **Scraper**: Will scrape data from **known** urls for specific websites, **not every website might be scrapable**, |
| 6 | +returning `ScrapedData` instances, this is the scheme for the data expected from a page |
| 7 | +* **Crawler**: Will crawl new urls from specific sources, this data should be fed to the scraper at some point |
| 8 | +* **Persistency Manager**: Will store data scraped by the scraper in some persistent storage, an SQLite-based |
| 9 | +manager is provided by default |
| 10 | +* **Classifier**: Classifies a `ScrapedData` instance telling if it is a public service problem or not. |
| 11 | +* **Experiment**: This class controls an experiment run, it's useful to manage logging and results for experiments. Also, |
| 12 | +it makes possible for every experiment to be ran in more or less the same way, making it easier to use for new comers. |
| 13 | +* **ExperimentFSManager**: Simple class controlling how to experiment's filesystems are stored, enabling an unified filesystem |
| 14 | +for every experiment. You can implement a new object with the same interface if you want to provide an alternative method |
| 15 | +experiment's storage |
| 16 | + |
| 17 | +<p align="center"> |
| 18 | + <img src= "../../img/microscope_architecture.png"> |
| 19 | +</p> |
| 20 | + |
| 21 | +!!! Warning |
| 22 | + The **classifier** should be more specific in the future, it should be able not only to differentiate between news talking |
| 23 | + about public services or not, but also the kind of problem itself |
| 24 | + |
| 25 | +--- |
| 26 | +# Scraper |
| 27 | +The **Scraper** component is just a **single function** that receives a list of urls to scrape and manages to select |
| 28 | +the right **scraper object** for such url (based on its domain) or **raise an error** if it's not able to find any **matching scraper**. |
| 29 | + |
| 30 | +## Example usage |
| 31 | +The next examples will show you how to use the scraper to scrape a list of urls, handle a possible non-valid url |
| 32 | +and filter out urls that may not be scrapable. |
| 33 | +### Scraping multiple urls with the Manager object |
| 34 | +The easiest way to scrape is using the manager object as follows: |
| 35 | +``` |
| 36 | +import c4v.microscope as ms |
| 37 | +
|
| 38 | +# Creates the default manager |
| 39 | +m = ms.Manager.from_default() |
| 40 | +
|
| 41 | +urls = [ |
| 42 | + "https://primicia.com.ve/mas/servicios/siete-trucos-caseros-para-limpiar-la-plancha-de-ropa/", |
| 43 | + "https://primicia.com.ve/guayana/ciudad/suenan-con-urbanismo-en-core-8/" |
| 44 | +] |
| 45 | +
|
| 46 | +# Output may depend on your internet connection and page availability |
| 47 | +for result in m.scrape(urls): |
| 48 | + print(result.pretty_repr(max_content_len = 100)) |
| 49 | +
|
| 50 | +``` |
| 51 | +### Scraping a single url |
| 52 | +``` |
| 53 | +import c4v.microscope as ms |
| 54 | +
|
| 55 | +m = ms.Manager.from_default() |
| 56 | +
|
| 57 | +url = "https://primicia.com.ve/mas/servicios/siete-trucos-caseros-para-limpiar-la-plancha-de-ropa/" |
| 58 | +
|
| 59 | +# Output may depend on your internet connection and page availability |
| 60 | +result = m.scrape(url) |
| 61 | +print(result.pretty_repr(max_content_len = 100)) |
| 62 | +``` |
| 63 | + |
| 64 | +### Removing non-scrapable urls |
| 65 | +Here we can see how to separate scrapable urls from non-scrapable ones. It may be helpful to know which urls can be processed |
| 66 | +``` |
| 67 | +import c4v.microscope as ms |
| 68 | +
|
| 69 | +m = ms.Manager.from_default() |
| 70 | +
|
| 71 | +urls = [ |
| 72 | + "https://primicia.com.ve", |
| 73 | + "https://elpitazo.net", |
| 74 | + "https://supernotscrapable.com" |
| 75 | +] |
| 76 | +
|
| 77 | +assert m.split_non_scrapable(urls) == (urls[:2], urls[2:]) |
| 78 | +``` |
| 79 | +### TODO |
| 80 | +add more useful examples |
| 81 | +## Creation |
| 82 | +You can create a new scraper in order to support scraping for new sites. More details about this in ["creating a scraper"](./creating-a-scraper.md) |
| 83 | +# Crawler |
| 84 | +TODO |
| 85 | +## Creation |
| 86 | +You can create a new crawler in order to support exploring new urls for new sites. More details about this in ["creating a crawler"](./creating-a-crawler.md) |
| 87 | +# Persistency Manager |
| 88 | +TODO |
| 89 | +## Creation |
| 90 | +You can create a new `Persistency Manager` object in order to support new ways of storing data. More details about this in ["creating a persistency manager"](./creating-a-persistency-manager.md) |
| 91 | +# Experiment |
| 92 | +TODO |
| 93 | +# ExperimentFSManager |
| 94 | +TODO |
0 commit comments