code-for-venezuela · LDiazN · Dec 3, 2021 · Dec 3, 2021 · Dec 7, 2021 · Dec 7, 2021
diff --git a/data/processed/huggingface/service_training_dataset.csv b/data/processed/huggingface/service_training_dataset.csv
diff --git a/docs/docs/development/architecture.md b/docs/docs/development/architecture.md
@@ -76,19 +76,77 @@ urls = [
 
 assert m.split_non_scrapable(urls) == (urls[:2], urls[2:])
 ```
-### TODO
-add more useful examples
 ## Creation 
 You can create a new scraper in order to support scraping for new sites. More details about this in ["creating a scraper"](./creating-a-scraper.md)
 # Crawler
-TODO
+The easiest way to call a crawler to get more urls is to use the `microscope.Manager` object:
+
+### Crawling N urls for a given site
+```python
+import c4v.microscope as ms
+
+manager = ms.Manager.from_default()
+# Crawls 100 urls using the crawler named "primicia"
+manager.crawl("primicia", limit = 100)
+```
+
+But, how do you know which values could be supported by this command? 
+### Getting possible crawlers
+```python
+import c4v.microscope as ms
+
+manager = ms.Manager.from_default()
+
+print(manager.available_crawlers())
+# ['primicia', 'el_pitazo']
+```
+You can use this list to check if a given crawler is a valid one or not.
+
+### Crawling new urls
+You can get unknown urls by using the configured database by using the following code. You can choose to not save them if you prefer so.
+```python
+import c4v.microscope as ms
+
+manager = ms.Manager.from_default()
+
+# None of the retrieved urls are already in database, store them after retrieve
+print(manager.crawl_new_urls_for(['primicia'], limit = 100))
+
+# None of the retrieved urls are already in database, don't save them
+print(manager.crawl_new_urls_for(['primicia'], limit = 100, save_to_db=False))
+```
+
+### Crawl and scrape
+Perhaps you just want to crawl urls to scrape it afterwards, you can do easily following this example:
+```python
+import c4v.microscope as ms
+
+manager = ms.Manager.from_default()
+
+# Crawl, scrape, and save to db at the same time
+manager.crawl_and_scrape_for(['primicia'], limit = 100)
+```
 ## Creation
 You can create a new crawler in order to support exploring new urls for new sites. More details about this in ["creating a crawler"](./creating-a-crawler.md)
 # Persistency Manager
-TODO
+The persistency manager component helps you to specify how data is persisted.
+There's multiple persistency managers, and users can even provide their own, but all of them should provide the same api. Right now, we have two specially important persistency managers:
+
+1. `SQLiteManager` : A persistency manager to **store data in a local SQLite db**, used by the default `microscope.Manager` object configuration and the cli tool.
+2. `BigQueryManager` : A persistency manager **that stores data in google cloud and firestore**. When a new instance comes from a crawling, **they're persisted to firestore** as long as its data is not filled by a scraper and a classifier. When all components run and the data is filled, **they're moved to Big Query**.
+
 ## Creation
 You can create a new `Persistency Manager` object in order to support new ways of storing data. More details about this in ["creating a persistency manager"](./creating-a-persistency-manager.md)
 # Experiment
-TODO
+An experiment is **a python script specifying a training run for a model class**, you can use them to create models and fast experimentation. Since they're python files, you can perform all the data manipulation you need before you run you're experiment. Right now we have the following models and experiments:
+
+- `service_classification_experiment.py` : An experiment to train a multi label model into service type classification.
+
+- `test_lang_model_train.py` : An experiment to train a model in a fill mask task in order to improve it's accuracy with a specific spanish dialect.
+
+- `test_relevance_classifier.py` : An experiment to train a model to tell if an article is relevant or not.
+
 # ExperimentFSManager
-TODO
+It's an object that will manage how experiments are saved. Usually, experiments are specified by a **branch** and an **experiment name**, in a file folder structure inside the c4v folder.
+
+
diff --git a/docs/docs/development/creating-a-crawler.md b/docs/docs/development/creating-a-crawler.md
@@ -1,2 +1,77 @@
 # Creating a Crawler
-TODO
+The base crawler works by scanning a site's sitemap looking for new urls to scrape. Creating a new crawler is about **telling the crawler how to 
+look for new urls in the sitemap.** 
+
+To create a crawler, all you need to do is to implement the base class `BaseCrawler` in `c4v.scraper.crawler.crawlers.base_crawler`. To do so, you have to provide an implementation for the required methods, a crawler `name`, and its `start_sitemap_url`. 
+
+For example:
+```python
+from c4v.scraper.crawler.crawlers.base_crawler import BaseCrawler
+
+class SampleCrawler(BaseCrawler):
+
+    start_sitemap_url = "https://samplesite.com/sitemap.xml"
+    name = "sample"
+```
+
+1. `start_sitemap_url` is the **site's sitemap**, which can be often found in `<your site domain>/sitemap.xml`
+2. `name` is the crawler's name, is **necessary to refer to it** on multiple operations
+
+Now we have to implement the **static method**
+``should_crawl``
+```python
+class SampleCrawler(BaseCrawler):
+
+    # crawler name and start sitemap url...
+
+    @staticmethod
+    def should_crawl(url: str) -> bool:
+        return url.startswith("https://samplesite.com/some-prefix-")
+```
+
+Usually, sitemaps are compound by more sitemaps, and this method will tell the crawler **which of those sub sitemaps are desired**. A common approach to check if this specific sitemap is useful or not is **checking its url**, as it might contain some specific patterns to imply some meaning.
+
+
+### Fine grained url filtering
+Maybe you want to perform some checking for each url that is to be scraped in the future, in order to do this, you can *optionally* implement the `should_scrape` method that will check if each url is a valid one.
+
+
+```python
+class SampleCrawler(BaseCrawler):
+
+    # crawler name and start sitemap url...
+    # should_crawl function...
+
+    def should_scrape(url: str) -> bool:
+        return url.startswith("https://samplesite.com") and len(url) > 42
+```
+
+With all this done, we already have a fully functional crawler, but **it's not yet available to the library** for some common operations, such as trigerring a crawling  for this new site.
+## Adding a crawler
+
+To add a crawler, all we need to do is to add its class to the `INSTALLED_CRAWLERS` list in the `c4v.scraper.settings` file:
+```python
+import sample_crawler
+
+INSTALLED_CRAWLERS: List[Type[BaseCrawler]] = [
+    <More crawlers>,
+    sample_crawler.SampleCrawler
+]
+```
+Now our new crawler will be available for common operations with scrapers, it will be recognized by the `microscope.Manager` object and it will be available for the CLI
+
+## Adding irrelevant articles filtering
+Sometimes we need to create a dataset with **non-relevant labeled data**, and label a lot of rows by hand can be a tiresome and time consuming task. In this case, you can specify the crawler that you want to crawl **only links that hold irrelevant information**.
+
+To do this, you can provide a list `IRRELEVANT_URLS` with **regex patterns** for links that you already know that are ensured to be irrelevant for your use case.
+```python
+class SampleCrawler(BaseCrawler):
+    # name and start sitemap link...
+    IRRELEVANT_URLS = [
+        ".*samplesite.com/irrelevant-section1/.*",
+        ".*samplesite.com/irrelevant-section2/.*",
+        # ...
+    ]
+
+    # should_crawl definition...
+```
diff --git a/docs/docs/development/creating-a-scraper.md b/docs/docs/development/creating-a-scraper.md
@@ -1,2 +1,97 @@
 # Creating a Scraper
-TODO
+A scraper is a component that receives some urls and returns all the data it can gather from that web page. Usually, **the kind of data you can get for each page might vary a lot**, but as we need to store data for a lot of sites, we need to make this data canonical for every site. In order to do this, data is usually scraped into a specific format for each site, and then such format should **provide a way to transform itself into the canonical data format**, the `ScrapedData` dataclass.
+
+!!!Info
+    More about python dataclasses [here](https://docs.python.org/3/library/dataclasses.html)
+
+The easiest way to create a new scraper is implement a *scrapy-based scraper*, the following section explains how to achieve such thing. If you need a fine-grained approach to scrape a specific web page, that's possible as well, by implementing the `BaseScraper` class. 
+
+## Creating a Scrapy-based scraper
+To create a Scrapy-based scrapers, we have to follow three simple steps:
+
+1. **Create the data format**: We will create a data format describing the data we can gather from our site.
+2. **Create a spider**: A scrapy object we use to parse data from the site
+3. **Create a scraper class**: The object that describes a scraper to the `c4v` library
+4. **Wire the scraper to the library**: Connect your brand new scraper to the rest of the features that the `c4v` library has to offer.
+
+### Creating the data format
+Let's say that we want to scrape articles for the site `samplesite.com`. First, we need to **create a data format for our specific site**, which will describe the data we know we can extract for every article in such site. To do so, we create a `dataclass` inheriting the base class `BaseDataFormat` in `c4v.scraper.scraped_data_classes.base_scraped_data`:
+```python
+from dataclasses import dataclass
+from c4v.scraper.scraped_data_classes.base_scraped_data import BaseDataFormat
+
+@dataclass(frozen=True)
+class SampleSiteData(BaseDataFormat):
+    """
+        Base data format for samplesite.com
+    """
+    tags: List[str]
+    categories: List[str]
+    title: str
+    author: str
+    date: str
+    body: str
+
+```
+
+Now that we have our data format, we have to impement the `to_scraped_data` method, so the scraper knows how to map from this data to the canonical `ScrapedData` format:
+
+```python
+from dataclasses import dataclass
+from c4v.scraper.scraped_data_classes.base_scraped_data import BaseDataFormat
+
+@dataclass(frozen=True)
+class SampleSiteData(BaseDataFormat):
+    # Fields...
+
+    def to_scraped_data(self) -> ScrapedData:
+        return ScrapedData(
+            author=self.author,
+            date=self.date,
+            title=self.title,
+            categories=self.tags + self.categories,
+            content=self.body,
+            url=self.url,
+            last_scraped=self.last_scraped,
+            source=Sources.SCRAPING,
+        )
+```
+
+!!!Warning
+    The `source` field is important as it tells you where this `ScrapedData` instance came from. In this case, it came from a scraper.
+
+### Creating the spider
+Now that we have our data format,  we need to create a scrapy spider that will retrieve information for each article. The process for creating a scrapy spider is well documented in the [scrapy documentation](https://docs.scrapy.org/en/latest/intro/tutorial.html#our-first-spider), so the only thing you need to know is that the `parse` method for your spider should return an instance of the data format we just created, the `SampleSiteData` class.
+
+### Creating the scraper class
+Now we need to create a scraper class that inherits the `BaseScrapyScraper` class. This class will describe our scraper to the rest of the library:
+
+```python
+from c4v.scraper.scrapers.base_scrapy_scraper import BaseScrapyScraper
+
+class SampleSiteScraper(BaseScrapyScraper):
+    """
+        Scrapes data from SampleSite
+    """
+
+    intended_domain = "samplesite.com"
+    spider = SampleSiteSpider
+
+```
+
+1. With `intender_domain` we're telling the library which urls are scrapable with this class (the ones that belong to this domain)
+2. With `spider`, we're telling the scraper that this is the spider it should use when scraping the site with scrapy
+
+### Wiring the scraper
+Now that we have a fully functional scraper, we might want to add it to the `c4v` library as another scraper, so it can be used in the cli tool, the dashboard, and the `microscope.Manager` object.
+
+To do this, the only thing we have to do is go to the `c4v.scraper.settings` module and add our scraper to the `INSTALLED_SCRAPERS` list:
+```python
+INSTALLED_SCRAPERS: List[Type[BaseScraper]] =
+    [
+        # More scrapers, 
+        SampleSiteScraper
+    ]
+```
+
+And that's it, now you have a fully functional scraper that it's available to the entire `c4v` library 🎉.
diff --git a/docs/docs/development/experiments.md b/docs/docs/development/experiments.md
@@ -0,0 +1,77 @@
+## Experiments and classification
+The microscope library doesn't provides a trained model to perform classifications, but you can train one using an experiment to train a model. 
+
+The library provides a few sample experiments, in this section you will learn how to run an experiment and create your own for a custom classification type.
+
+## Running a sample experiment
+
+You can find a few experiments in the `experiments_samples` folder in the root project directory. Today we'll use the `test_relevance_classifier.py` experiment to train a model for relevance classification. Let's **create a file in the root project directory** and paste the following code from such experiment:
+
+```python
+"""
+    Sample for a classifier experiment
+"""
+from c4v.classifier.classifier_experiment import ClassifierArgs, ClassifierExperiment
+from c4v.scraper.scraped_data_classes.scraped_data import RelevanceClassificationLabels
+args = ClassifierArgs(
+    training_args={
+        "per_device_train_batch_size" : 10,
+        "per_device_eval_batch_size" : 1,
+        "num_train_epochs" : 3,
+        "warmup_steps" : 1000,
+        "save_steps" : 1000,
+        "save_total_limit" : 1,
+        "evaluation_strategy" : "epoch",
+        "save_strategy" : "epoch",
+        "load_best_model_at_end" : True,
+        "eval_accumulation_steps" : 1,
+        "learning_rate" : 5e-07,
+        "adafactor" : True,
+    },
+    columns=['title'],
+    train_dataset_name="relevance_training_dataset.csv",
+    confirmation_dataset_name= "relevance_confirmation_dataset.csv",
+    label_column="label_relevance",
+    description="Classifier sample",
+    labelset=RelevanceClassificationLabels
+
+)
+
+exp = ClassifierExperiment.from_branch_and_experiment("samples", "relevance_classifier")
+exp.run_experiment(args)
+```
+We're going to explain step by step what's happening here:   
+
+1. The `train_args` dictionary is a dictionary with arguments that are later passed to the huggingface training API as training arguments. More about huggingface training arguments [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).   
+2. The `columns` field tells the model which fields to use as input from the specified dataset. All our datasets should have at the least the fields in the `ScrapedData` data format. This experiment is using the title as input only.   
+3. The `train_dataset_name` is the name of a dataset stored in `data/processed/huggingface` .    
+4. The `label_column` field is the name of the column inside the provided dataset you will use as target label. Por example, if you want to predict results in column `A`, then this field is set to `A`.     
+5. `description` is a human readable description for this experiment, you can omit it, but is highly recommended to understand what you were trying to achieve with this model.    
+6. `labelet` it's an enum class specifying which labels this model should use, as this is a multilabel training.
+
+Now, to run this experiment we just run it like any other python program with `python sample_experiment.py`
+
+This is how usually an experiment looks like. Most experiments to improve a classifier's performance will change just this, the values passed to the model, and the previous configuration is encapsulated by the `Experiment` class.
+
+## Creating a new experiment
+
+To create your own experiment, you just have to implement the following classes that you can find in `src/c4v/classifier/experiment.py`:
+
+* `BaseExperimentSummary` : It's a class that provides information about the experiment's results. You should add fields relevant to your experiment, and then override how the information is printed (by overriding the `__str__` method). This class is used to store in local storage information about the experiment.
+* `BaseExperimentArguments` : It's a class that holds the necessary data for your experiment. It's important so the library can register which arguments you used for previous experiments.
+* `BaseExperiment` : This class is the experiment that will be run. All you have to do is to implement your experiment setup and execution in the `experiment_to_run`. Note that such function receives a `BaseExperimentArguments` as input and returns an instance of `BaseExperimentSummary` as output. 
+
+Now, if you implement your experiments this way, all the data related to your experiments will be managed automatically by the `c4v-py` library, and the resulting model itself will be available for classification!.
+
+## Classifier classes
+Right now, we provide support for relevance classification and service classification, but note that those are not specific classes of a model, they're the same `Classifier` model class with different configuration options. You can create them by using:
+
+```python
+# Create a relevance classifier
+relevance_model = Classifier.relevance(<Same args as in the classifier class>)
+
+# Create a service classifier
+relevance_model = Classifier.service(<Same args as in the classifier class>)
+```
+
+This is the desired way to create multilabel classifiers. For a fine grained implementation, you should inherit the `BaseModel` class in `src/c4v/classifier/base_model.py`
diff --git a/docs/docs/index.md b/docs/docs/index.md
@@ -21,6 +21,10 @@ Use pip to install the package:
 pip install c4v-py
 ```
 
+### Installation profiles 
+You can access to multiple installation profiles to fit your specific needs of the library, and match the type
+of environment you need. More about it [here](usage/installation.md).
+
 ## Usage
 The c4v-py package can be used either as a command line tool and as 
 a library.
@@ -53,7 +57,7 @@ d = manager.crawl_new_urls_for(
 print(d)       # A (possibly empty) list of urls as string
 print(len(d))  # a number <= 10
 ```
-More about it here
+More about it [here](usage/microscope-as-a-library.md).
 
 ## Contributing