Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
cef1d63
reverted bad git merge
LDiazN Dec 3, 2021
1101e6b
added initial layout for c4v app
LDiazN Dec 3, 2021
c8cdf74
added loging functions to CLICLient
LDiazN Dec 7, 2021
686733b
added logging levels to cli tool
LDiazN Dec 7, 2021
3b39257
added test script to help with hand testing
LDiazN Dec 7, 2021
5e1c071
added scrape and crawl functions to manager class
LDiazN Dec 8, 2021
612a954
changed dependencies to add big query and cloud dependencies
LDiazN Dec 9, 2021
da65b85
started initial version of big query persistency manager
LDiazN Dec 9, 2021
95558db
added cloud functions; updated lockfile
LDiazN Dec 10, 2021
63b1fe1
debugged big query persistency manager
LDiazN Dec 10, 2021
99b319e
added initial version of cloud crawl function
LDiazN Dec 14, 2021
9485fdf
added initial version of scrape cloud function
LDiazN Dec 14, 2021
7e12fce
added some docs
LDiazN Dec 14, 2021
184f788
added initial version of gcloud storage manager
LDiazN Dec 15, 2021
1fd3bed
added object to manage files from google storage
LDiazN Dec 16, 2021
01d920e
added some docs to functions
LDiazN Dec 16, 2021
a990bce
added tar compression on upload of model file
LDiazN Dec 16, 2021
a13d45c
added some error checking and info prints
LDiazN Dec 16, 2021
13dd72e
replaced info prints with logging.info
LDiazN Dec 16, 2021
fe950d2
added config to set up a bucket where to look for files
LDiazN Dec 16, 2021
5264ce2
added functions to upload and download models from the cloud
LDiazN Dec 16, 2021
c12f093
removed typo
LDiazN Dec 17, 2021
7e73097
fixed dumb bug
LDiazN Dec 17, 2021
fb21184
added cli functions to upload and download models
LDiazN Dec 17, 2021
61c7425
added functions on dashboard and cli tool to upload and download models
LDiazN Dec 17, 2021
048a43b
added cloud backend to dashboard app
LDiazN Jan 3, 2022
ec78032
added dict representation to scraped data
LDiazN Jan 5, 2022
ad47808
some docs
LDiazN Jan 5, 2022
75a1c4e
added some firestore based persistency manager functions
LDiazN Jan 7, 2022
82b935b
added gcloud persistency manager
LDiazN Jan 10, 2022
bc6ea32
added firestore
LDiazN Jan 11, 2022
32154b0
updated dependencies
LDiazN Jan 11, 2022
d783878
finished BQ persistency manager
LDiazN Jan 11, 2022
38a365b
added crawling function
LDiazN Jan 12, 2022
f32fa90
removed debug comment
LDiazN Jan 12, 2022
d8d0f14
added requierements for cloud function
LDiazN Jan 12, 2022
3d41286
obscure patch to permission issue with firestore
LDiazN Jan 12, 2022
4bffbea
some testing in dashboard app; fixed a bad message in cli tool
LDiazN Jan 12, 2022
c8ec36a
changed argument name
LDiazN Jan 12, 2022
5f31bce
better configuration for scrape function
LDiazN Jan 12, 2022
e0f5ccf
now scrapy can be called many times
LDiazN Jan 12, 2022
8b8d815
added scrapydo as dependency
LDiazN Jan 12, 2022
68e8b63
finished scrape function
LDiazN Jan 12, 2022
169ccdd
added initial structure for classification function
LDiazN Jan 12, 2022
ad72664
initial classify function code updated; some changes to manager api
LDiazN Jan 13, 2022
43bceb5
modified default values of big query persistency manager and base per…
LDiazN Jan 14, 2022
6c1e94b
added limit to classification function
LDiazN Jan 14, 2022
a7b536b
added move function to move data from firestore to big query when it'…
LDiazN Jan 14, 2022
e541ae3
added cloud functions to dashboard on cloud backend
LDiazN Jan 18, 2022
e18634f
changed big query date format
LDiazN Jan 18, 2022
693c580
deleted useless file
LDiazN Jan 18, 2022
5a6d561
removed debug print
LDiazN Jan 18, 2022
3ce68c5
merging multilabel classification
LDiazN Jan 18, 2022
802d07d
fixed bug about using invalid label name
LDiazN Jan 18, 2022
e4bd87a
bug fixing
LDiazN Jan 19, 2022
750c568
fixed bug about using the wrong type of score on multilabel
LDiazN Jan 19, 2022
bc7ef6c
using wrong label
LDiazN Jan 19, 2022
5d6ed0b
fixing missing type of classification
LDiazN Jan 19, 2022
2d5dc48
removed debug print
LDiazN Jan 19, 2022
8a6584c
added default service classification of non relevant rows
LDiazN Jan 19, 2022
e4f05d4
updated transformers
LDiazN Jan 19, 2022
4721c52
updated sample about service classification
LDiazN Jan 19, 2022
67db582
added upload button
LDiazN Jan 19, 2022
b6579a9
added move button to dashboard
LDiazN Jan 20, 2022
ff5b238
added informative info message for an error
LDiazN Jan 20, 2022
c52bdac
changed big query manager to expect date in string format
LDiazN Jan 20, 2022
2237f3e
changed cloud function to clean environment on exit
LDiazN Jan 20, 2022
c954280
added final version of classification cloud function
LDiazN Jan 21, 2022
449d1f7
fixed labels function in labelset
LDiazN Jan 22, 2022
02c46a9
added 'creating a crawler' page
LDiazN Jan 25, 2022
7872856
added explanation about crawlers
LDiazN Jan 26, 2022
a45e827
added guide to create a scraper
LDiazN Jan 26, 2022
fa61af5
extended page about architecture
LDiazN Jan 26, 2022
d319944
added installation & configuration profiles to docs
LDiazN Jan 27, 2022
9b32694
added configuration by environment variables page; added link to inst…
LDiazN Jan 27, 2022
2a169f3
added page with tutorial about the cli tool
LDiazN Jan 27, 2022
246751f
removed debug print
LDiazN Jan 28, 2022
f463cfe
small fixes
LDiazN Jan 28, 2022
f99993b
black reformat
LDiazN Jan 28, 2022
532f45e
removed debug comment
LDiazN Jan 28, 2022
19ee4d3
black reformat
LDiazN Jan 28, 2022
d009226
removed useless import
LDiazN Jan 28, 2022
4218384
updated lockfile
LDiazN Jan 28, 2022
cf638bc
added missing docs
LDiazN Jan 28, 2022
6d12c0d
Merge branch 'luis/dashboaord-app' into luis/final-docs
LDiazN Jan 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4,801 changes: 2,400 additions & 2,401 deletions data/processed/huggingface/service_training_dataset.csv

Large diffs are not rendered by default.

70 changes: 64 additions & 6 deletions docs/docs/development/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,19 +76,77 @@ urls = [

assert m.split_non_scrapable(urls) == (urls[:2], urls[2:])
```
### TODO
add more useful examples
## Creation
You can create a new scraper in order to support scraping for new sites. More details about this in ["creating a scraper"](./creating-a-scraper.md)
# Crawler
TODO
The easiest way to call a crawler to get more urls is to use the `microscope.Manager` object:

### Crawling N urls for a given site
```python
import c4v.microscope as ms

manager = ms.Manager.from_default()
# Crawls 100 urls using the crawler named "primicia"
manager.crawl("primicia", limit = 100)
```

But, how do you know which values could be supported by this command?
### Getting possible crawlers
```python
import c4v.microscope as ms

manager = ms.Manager.from_default()

print(manager.available_crawlers())
# ['primicia', 'el_pitazo']
```
You can use this list to check if a given crawler is a valid one or not.

### Crawling new urls
You can get unknown urls by using the configured database by using the following code. You can choose to not save them if you prefer so.
```python
import c4v.microscope as ms

manager = ms.Manager.from_default()

# None of the retrieved urls are already in database, store them after retrieve
print(manager.crawl_new_urls_for(['primicia'], limit = 100))

# None of the retrieved urls are already in database, don't save them
print(manager.crawl_new_urls_for(['primicia'], limit = 100, save_to_db=False))
```

### Crawl and scrape
Perhaps you just want to crawl urls to scrape it afterwards, you can do easily following this example:
```python
import c4v.microscope as ms

manager = ms.Manager.from_default()

# Crawl, scrape, and save to db at the same time
manager.crawl_and_scrape_for(['primicia'], limit = 100)
```
## Creation
You can create a new crawler in order to support exploring new urls for new sites. More details about this in ["creating a crawler"](./creating-a-crawler.md)
# Persistency Manager
TODO
The persistency manager component helps you to specify how data is persisted.
There's multiple persistency managers, and users can even provide their own, but all of them should provide the same api. Right now, we have two specially important persistency managers:

1. `SQLiteManager` : A persistency manager to **store data in a local SQLite db**, used by the default `microscope.Manager` object configuration and the cli tool.
2. `BigQueryManager` : A persistency manager **that stores data in google cloud and firestore**. When a new instance comes from a crawling, **they're persisted to firestore** as long as its data is not filled by a scraper and a classifier. When all components run and the data is filled, **they're moved to Big Query**.

## Creation
You can create a new `Persistency Manager` object in order to support new ways of storing data. More details about this in ["creating a persistency manager"](./creating-a-persistency-manager.md)
# Experiment
TODO
An experiment is **a python script specifying a training run for a model class**, you can use them to create models and fast experimentation. Since they're python files, you can perform all the data manipulation you need before you run you're experiment. Right now we have the following models and experiments:

- `service_classification_experiment.py` : An experiment to train a multi label model into service type classification.

- `test_lang_model_train.py` : An experiment to train a model in a fill mask task in order to improve it's accuracy with a specific spanish dialect.

- `test_relevance_classifier.py` : An experiment to train a model to tell if an article is relevant or not.

# ExperimentFSManager
TODO
It's an object that will manage how experiments are saved. Usually, experiments are specified by a **branch** and an **experiment name**, in a file folder structure inside the c4v folder.


77 changes: 76 additions & 1 deletion docs/docs/development/creating-a-crawler.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,77 @@
# Creating a Crawler
TODO
The base crawler works by scanning a site's sitemap looking for new urls to scrape. Creating a new crawler is about **telling the crawler how to
look for new urls in the sitemap.**

To create a crawler, all you need to do is to implement the base class `BaseCrawler` in `c4v.scraper.crawler.crawlers.base_crawler`. To do so, you have to provide an implementation for the required methods, a crawler `name`, and its `start_sitemap_url`.

For example:
```python
from c4v.scraper.crawler.crawlers.base_crawler import BaseCrawler

class SampleCrawler(BaseCrawler):

start_sitemap_url = "https://samplesite.com/sitemap.xml"
name = "sample"
```

1. `start_sitemap_url` is the **site's sitemap**, which can be often found in `<your site domain>/sitemap.xml`
2. `name` is the crawler's name, is **necessary to refer to it** on multiple operations

Now we have to implement the **static method**
``should_crawl``
```python
class SampleCrawler(BaseCrawler):

# crawler name and start sitemap url...

@staticmethod
def should_crawl(url: str) -> bool:
return url.startswith("https://samplesite.com/some-prefix-")
```

Usually, sitemaps are compound by more sitemaps, and this method will tell the crawler **which of those sub sitemaps are desired**. A common approach to check if this specific sitemap is useful or not is **checking its url**, as it might contain some specific patterns to imply some meaning.


### Fine grained url filtering
Maybe you want to perform some checking for each url that is to be scraped in the future, in order to do this, you can *optionally* implement the `should_scrape` method that will check if each url is a valid one.


```python
class SampleCrawler(BaseCrawler):

# crawler name and start sitemap url...
# should_crawl function...

def should_scrape(url: str) -> bool:
return url.startswith("https://samplesite.com") and len(url) > 42
```

With all this done, we already have a fully functional crawler, but **it's not yet available to the library** for some common operations, such as trigerring a crawling for this new site.
## Adding a crawler

To add a crawler, all we need to do is to add its class to the `INSTALLED_CRAWLERS` list in the `c4v.scraper.settings` file:
```python
import sample_crawler

INSTALLED_CRAWLERS: List[Type[BaseCrawler]] = [
<More crawlers>,
sample_crawler.SampleCrawler
]
```
Now our new crawler will be available for common operations with scrapers, it will be recognized by the `microscope.Manager` object and it will be available for the CLI

## Adding irrelevant articles filtering
Sometimes we need to create a dataset with **non-relevant labeled data**, and label a lot of rows by hand can be a tiresome and time consuming task. In this case, you can specify the crawler that you want to crawl **only links that hold irrelevant information**.

To do this, you can provide a list `IRRELEVANT_URLS` with **regex patterns** for links that you already know that are ensured to be irrelevant for your use case.
```python
class SampleCrawler(BaseCrawler):
# name and start sitemap link...
IRRELEVANT_URLS = [
".*samplesite.com/irrelevant-section1/.*",
".*samplesite.com/irrelevant-section2/.*",
# ...
]

# should_crawl definition...
```
97 changes: 96 additions & 1 deletion docs/docs/development/creating-a-scraper.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,97 @@
# Creating a Scraper
TODO
A scraper is a component that receives some urls and returns all the data it can gather from that web page. Usually, **the kind of data you can get for each page might vary a lot**, but as we need to store data for a lot of sites, we need to make this data canonical for every site. In order to do this, data is usually scraped into a specific format for each site, and then such format should **provide a way to transform itself into the canonical data format**, the `ScrapedData` dataclass.

!!!Info
More about python dataclasses [here](https://docs.python.org/3/library/dataclasses.html)

The easiest way to create a new scraper is implement a *scrapy-based scraper*, the following section explains how to achieve such thing. If you need a fine-grained approach to scrape a specific web page, that's possible as well, by implementing the `BaseScraper` class.

## Creating a Scrapy-based scraper
To create a Scrapy-based scrapers, we have to follow three simple steps:

1. **Create the data format**: We will create a data format describing the data we can gather from our site.
2. **Create a spider**: A scrapy object we use to parse data from the site
3. **Create a scraper class**: The object that describes a scraper to the `c4v` library
4. **Wire the scraper to the library**: Connect your brand new scraper to the rest of the features that the `c4v` library has to offer.

### Creating the data format
Let's say that we want to scrape articles for the site `samplesite.com`. First, we need to **create a data format for our specific site**, which will describe the data we know we can extract for every article in such site. To do so, we create a `dataclass` inheriting the base class `BaseDataFormat` in `c4v.scraper.scraped_data_classes.base_scraped_data`:
```python
from dataclasses import dataclass
from c4v.scraper.scraped_data_classes.base_scraped_data import BaseDataFormat

@dataclass(frozen=True)
class SampleSiteData(BaseDataFormat):
"""
Base data format for samplesite.com
"""
tags: List[str]
categories: List[str]
title: str
author: str
date: str
body: str

```

Now that we have our data format, we have to impement the `to_scraped_data` method, so the scraper knows how to map from this data to the canonical `ScrapedData` format:

```python
from dataclasses import dataclass
from c4v.scraper.scraped_data_classes.base_scraped_data import BaseDataFormat

@dataclass(frozen=True)
class SampleSiteData(BaseDataFormat):
# Fields...

def to_scraped_data(self) -> ScrapedData:
return ScrapedData(
author=self.author,
date=self.date,
title=self.title,
categories=self.tags + self.categories,
content=self.body,
url=self.url,
last_scraped=self.last_scraped,
source=Sources.SCRAPING,
)
```

!!!Warning
The `source` field is important as it tells you where this `ScrapedData` instance came from. In this case, it came from a scraper.

### Creating the spider
Now that we have our data format, we need to create a scrapy spider that will retrieve information for each article. The process for creating a scrapy spider is well documented in the [scrapy documentation](https://docs.scrapy.org/en/latest/intro/tutorial.html#our-first-spider), so the only thing you need to know is that the `parse` method for your spider should return an instance of the data format we just created, the `SampleSiteData` class.

### Creating the scraper class
Now we need to create a scraper class that inherits the `BaseScrapyScraper` class. This class will describe our scraper to the rest of the library:

```python
from c4v.scraper.scrapers.base_scrapy_scraper import BaseScrapyScraper

class SampleSiteScraper(BaseScrapyScraper):
"""
Scrapes data from SampleSite
"""

intended_domain = "samplesite.com"
spider = SampleSiteSpider

```

1. With `intender_domain` we're telling the library which urls are scrapable with this class (the ones that belong to this domain)
2. With `spider`, we're telling the scraper that this is the spider it should use when scraping the site with scrapy

### Wiring the scraper
Now that we have a fully functional scraper, we might want to add it to the `c4v` library as another scraper, so it can be used in the cli tool, the dashboard, and the `microscope.Manager` object.

To do this, the only thing we have to do is go to the `c4v.scraper.settings` module and add our scraper to the `INSTALLED_SCRAPERS` list:
```python
INSTALLED_SCRAPERS: List[Type[BaseScraper]] =
[
# More scrapers,
SampleSiteScraper
]
```

And that's it, now you have a fully functional scraper that it's available to the entire `c4v` library 🎉.
77 changes: 77 additions & 0 deletions docs/docs/development/experiments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
## Experiments and classification
The microscope library doesn't provides a trained model to perform classifications, but you can train one using an experiment to train a model.

The library provides a few sample experiments, in this section you will learn how to run an experiment and create your own for a custom classification type.

## Running a sample experiment

You can find a few experiments in the `experiments_samples` folder in the root project directory. Today we'll use the `test_relevance_classifier.py` experiment to train a model for relevance classification. Let's **create a file in the root project directory** and paste the following code from such experiment:

```python
"""
Sample for a classifier experiment
"""
from c4v.classifier.classifier_experiment import ClassifierArgs, ClassifierExperiment
from c4v.scraper.scraped_data_classes.scraped_data import RelevanceClassificationLabels
args = ClassifierArgs(
training_args={
"per_device_train_batch_size" : 10,
"per_device_eval_batch_size" : 1,
"num_train_epochs" : 3,
"warmup_steps" : 1000,
"save_steps" : 1000,
"save_total_limit" : 1,
"evaluation_strategy" : "epoch",
"save_strategy" : "epoch",
"load_best_model_at_end" : True,
"eval_accumulation_steps" : 1,
"learning_rate" : 5e-07,
"adafactor" : True,
},
columns=['title'],
train_dataset_name="relevance_training_dataset.csv",
confirmation_dataset_name= "relevance_confirmation_dataset.csv",
label_column="label_relevance",
description="Classifier sample",
labelset=RelevanceClassificationLabels

)

exp = ClassifierExperiment.from_branch_and_experiment("samples", "relevance_classifier")
exp.run_experiment(args)
```
We're going to explain step by step what's happening here:

1. The `train_args` dictionary is a dictionary with arguments that are later passed to the huggingface training API as training arguments. More about huggingface training arguments [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).
2. The `columns` field tells the model which fields to use as input from the specified dataset. All our datasets should have at the least the fields in the `ScrapedData` data format. This experiment is using the title as input only.
3. The `train_dataset_name` is the name of a dataset stored in `data/processed/huggingface` .
4. The `label_column` field is the name of the column inside the provided dataset you will use as target label. Por example, if you want to predict results in column `A`, then this field is set to `A`.
5. `description` is a human readable description for this experiment, you can omit it, but is highly recommended to understand what you were trying to achieve with this model.
6. `labelet` it's an enum class specifying which labels this model should use, as this is a multilabel training.

Now, to run this experiment we just run it like any other python program with `python sample_experiment.py`

This is how usually an experiment looks like. Most experiments to improve a classifier's performance will change just this, the values passed to the model, and the previous configuration is encapsulated by the `Experiment` class.

## Creating a new experiment

To create your own experiment, you just have to implement the following classes that you can find in `src/c4v/classifier/experiment.py`:

* `BaseExperimentSummary` : It's a class that provides information about the experiment's results. You should add fields relevant to your experiment, and then override how the information is printed (by overriding the `__str__` method). This class is used to store in local storage information about the experiment.
* `BaseExperimentArguments` : It's a class that holds the necessary data for your experiment. It's important so the library can register which arguments you used for previous experiments.
* `BaseExperiment` : This class is the experiment that will be run. All you have to do is to implement your experiment setup and execution in the `experiment_to_run`. Note that such function receives a `BaseExperimentArguments` as input and returns an instance of `BaseExperimentSummary` as output.

Now, if you implement your experiments this way, all the data related to your experiments will be managed automatically by the `c4v-py` library, and the resulting model itself will be available for classification!.

## Classifier classes
Right now, we provide support for relevance classification and service classification, but note that those are not specific classes of a model, they're the same `Classifier` model class with different configuration options. You can create them by using:

```python
# Create a relevance classifier
relevance_model = Classifier.relevance(<Same args as in the classifier class>)

# Create a service classifier
relevance_model = Classifier.service(<Same args as in the classifier class>)
```

This is the desired way to create multilabel classifiers. For a fine grained implementation, you should inherit the `BaseModel` class in `src/c4v/classifier/base_model.py`
6 changes: 5 additions & 1 deletion docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ Use pip to install the package:
pip install c4v-py
```

### Installation profiles
You can access to multiple installation profiles to fit your specific needs of the library, and match the type
of environment you need. More about it [here](usage/installation.md).

## Usage
The c4v-py package can be used either as a command line tool and as
a library.
Expand Down Expand Up @@ -53,7 +57,7 @@ d = manager.crawl_new_urls_for(
print(d) # A (possibly empty) list of urls as string
print(len(d)) # a number <= 10
```
More about it here
More about it [here](usage/microscope-as-a-library.md).

## Contributing

Expand Down
Loading