Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
279ffad
added initial docs layout
LDiazN Sep 1, 2021
c1be69b
added mkdocs as dependency
LDiazN Sep 1, 2021
88bcbad
added section about c4v-py as library and as cli
LDiazN Sep 2, 2021
f51a751
merge conflict
LDiazN Sep 2, 2021
e3eb31a
Some refactors for the api to be easier to use
LDiazN Sep 6, 2021
085d11b
added reimports to __init__.py files to expose the public api; fixed …
LDiazN Sep 7, 2021
ea7337a
added usage pages to docs
LDiazN Sep 7, 2021
8429c68
added low level api section to microscope as a library page; added na…
LDiazN Sep 8, 2021
bdd3919
added development pages
LDiazN Sep 8, 2021
96e39e2
added reimports to classifier module
LDiazN Sep 8, 2021
0bbe32a
test script to check how to evaluate the base language model accuracy
LDiazN Sep 14, 2021
cb2f85a
added helpful message
LDiazN Sep 14, 2021
9f0eb77
removed useless imports
LDiazN Sep 14, 2021
0157d99
language model class still WIP, working on dataset creation
LDiazN Sep 15, 2021
d94cc53
added default language model to config
LDiazN Sep 15, 2021
6a5d0b0
added dataset creation function
LDiazN Sep 15, 2021
0c85a87
added base class for models
LDiazN Sep 17, 2021
21e076d
added language model training for fillmask task
LDiazN Sep 17, 2021
08ac456
fixed training and eval methods in language model; added model and to…
LDiazN Sep 20, 2021
8375d2b
added 3.7 to nox file tests
LDiazN Sep 21, 2021
3a102d2
added GPUtil profiling tool
LDiazN Sep 21, 2021
7e40634
added language model experiment
LDiazN Sep 21, 2021
542f336
added function to check if should train base language model
LDiazN Sep 21, 2021
9951fab
properly configures torch in pyproject
LDiazN Sep 22, 2021
12f9130
testing 3.7 versioning
LDiazN Sep 22, 2021
305face
trying to lower gpu memory needed to do an evaluation loop
LDiazN Sep 22, 2021
b048f1b
fixed OOM issue, but in exchange for the compute metrics function
LDiazN Sep 23, 2021
e90de09
initial version of manager reevaluation of language model function
LDiazN Sep 23, 2021
dc63558
updated lockfile
LDiazN Sep 23, 2021
acd8cd5
added tokenizer saving and loading in order to load pretrained tokeni…
LDiazN Sep 23, 2021
777eb0b
finished function to test accuracy of lang model from microscope manager
LDiazN Sep 23, 2021
03639ba
fixed issue with summary report for experiments
LDiazN Sep 24, 2021
600ed52
added sorting and testing to persistency manager
LDiazN Sep 24, 2021
49c9f30
added confirmation dataset creation script
LDiazN Sep 29, 2021
e2a9cbe
added confirmation dataset; renamed training dataset
LDiazN Sep 29, 2021
f319e69
fixed label assigning in irrelevant news
LDiazN Sep 29, 2021
2dc9362
some refactor
LDiazN Sep 29, 2021
2ab44f1
removed useless import
LDiazN Sep 29, 2021
c26ec07
refactor to reduce memory usage
LDiazN Sep 29, 2021
c4897f6
removing test file
LDiazN Sep 29, 2021
2ffb23f
added csv with newer data
LDiazN Sep 29, 2021
f73410d
classifier refactor to add classifier base model name as an optional …
LDiazN Sep 29, 2021
fb53bc8
black reformat
LDiazN Sep 29, 2021
255ec41
removed useless imports
LDiazN Sep 29, 2021
4da4d30
added experiments samples folder
LDiazN Sep 29, 2021
3fc4d27
refactor; removed already-done to-do
LDiazN Sep 30, 2021
a80b3f1
updated comments
LDiazN Sep 30, 2021
c77dbde
added inicial version of metadata class
LDiazN Sep 30, 2021
e407b8d
added metadata to instance member for the manager object; added funct…
LDiazN Sep 30, 2021
188b879
fixed json creation
LDiazN Sep 30, 2021
3791cf6
fixed json loading
LDiazN Sep 30, 2021
21e0b8b
added function to filter by known urls; fixed bug where the crawler w…
LDiazN Oct 1, 2021
1f7c234
added summary and experiments listing
LDiazN Oct 4, 2021
f7d87c2
added bulk classification
LDiazN Oct 5, 2021
7b6fef8
added scrape pending command; fixed issue when saving labels and source
LDiazN Oct 7, 2021
3ae8521
removed useless print
LDiazN Oct 8, 2021
47da342
added limit for how much instances to classify
LDiazN Oct 11, 2021
37b5c83
updated some docstrings; refactor to move classify pending logic to m…
LDiazN Oct 13, 2021
40d5c84
upated docstrings
LDiazN Oct 13, 2021
847c460
black reformat
LDiazN Oct 13, 2021
a6c6906
removed useless imports
LDiazN Oct 13, 2021
6b313ba
workaround process issue
LDiazN Oct 13, 2021
d58034f
updated poetry.lock file
LDiazN Oct 14, 2021
764e9bb
fixed merge conflict
LDiazN Oct 15, 2021
a244e7f
added image diagram; added content to scraper section
LDiazN Oct 15, 2021
4e91e86
added example persistency manager; fixed signature defaults in base p…
LDiazN Oct 18, 2021
981f993
finished persistency manager creation page; started persistency manag…
LDiazN Oct 20, 2021
7a99e46
added sample persistency manager
LDiazN Oct 20, 2021
ba9b4a3
created persistency manager folder; added tests for sample persistenc…
LDiazN Oct 20, 2021
a2a7f38
finished testing section in persistency manager creation example
LDiazN Oct 20, 2021
b781542
finished testing section in persistency manager creation article
LDiazN Oct 20, 2021
47b0a17
Create publish_docs.yml
LDiazN Oct 21, 2021
f2aca68
Update publish_docs.yml
LDiazN Oct 21, 2021
e672e3a
Create logger.py
LDiazN Oct 21, 2021
cb2b966
Create stage.py
LDiazN Oct 21, 2021
fc931d5
Create quality_check.py
LDiazN Oct 21, 2021
8d0aa57
added mkdocs to requirements
LDiazN Oct 21, 2021
e2094f0
Merge branch 'luis/docs' of https://github.com/code-for-venezuela/c4v…
LDiazN Oct 21, 2021
f67aa9d
Update publish_docs.yml
LDiazN Oct 21, 2021
da8c605
changed mkdocs version n requirements
LDiazN Oct 21, 2021
88ec38f
Merge branch 'luis/docs' of https://github.com/code-for-venezuela/c4v…
LDiazN Oct 21, 2021
8a22b3f
upgraded nox
LDiazN Oct 21, 2021
b83e363
moved CI scripts folder
LDiazN Oct 21, 2021
9d5219a
Update publish_docs.yml
LDiazN Oct 21, 2021
b63f9f8
Merge branch 'luis/docs' of https://github.com/code-for-venezuela/c4v…
LDiazN Oct 21, 2021
556d6d4
mkdocs files moved to docs folder
LDiazN Oct 21, 2021
1edc879
moved docs location
LDiazN Oct 21, 2021
0dde0e9
configured pages branch and access token
LDiazN Oct 26, 2021
085c2b0
Update from commit: ${GITHUB_SHA} - configured pages branch and acces…
Oct 26, 2021
43fe997
Revert "Update from commit: ${GITHUB_SHA} - configured pages branch a…
LDiazN Oct 26, 2021
bad5bae
Update from commit: ${GITHUB_SHA} - Revert "Update from commit: ${GIT…
Oct 26, 2021
56143e1
Revert "Update from commit: ${GITHUB_SHA} - Revert "Update from commi…
LDiazN Oct 26, 2021
5eac93b
changed push branch for docs
LDiazN Oct 26, 2021
3f06dfa
Update publish_docs.yml
LDiazN Oct 27, 2021
31b0733
added warning about the persistency manager implementation
LDiazN Oct 27, 2021
6464a3f
Merge branch 'luis/docs' of https://github.com/code-for-venezuela/c4v…
LDiazN Oct 27, 2021
3e9baa2
changed github sha value
LDiazN Oct 27, 2021
ab6a0d0
added db as argument in from default builder function
LDiazN Oct 28, 2021
1cbf93a
Merge branch 'luis/docs' of https://github.com/code-for-venezuela/c4v…
LDiazN Oct 28, 2021
d312def
some corrections; removed builded site
LDiazN Oct 28, 2021
2bb2021
some usability refactors
LDiazN Oct 28, 2021
58b1270
updated tensorflow
LDiazN Oct 28, 2021
2245b71
added more information in the common workflow example
LDiazN Oct 28, 2021
2681fea
moved section of common workflow to the examples section
LDiazN Oct 28, 2021
e5ad59d
removed test code
LDiazN Nov 3, 2021
5f55002
added description of the scraped field
LDiazN Nov 3, 2021
d70a1a2
fixed extra endline typo
LDiazN Nov 3, 2021
8962fd4
fixed lockfile bug again
LDiazN Nov 3, 2021
ec49876
fixed bug using the wrong function when crawling
LDiazN Nov 8, 2021
02aea9a
fixed bug for filtering known urls
LDiazN Nov 8, 2021
0943d9f
updated poetry
LDiazN Nov 9, 2021
6fd84f4
making versions match
LDiazN Nov 9, 2021
fdaa50b
added documentation link to readme
LDiazN Nov 9, 2021
a931fd1
changed files layout to appease python 3.6 import system
LDiazN Nov 11, 2021
dd0414f
changed main branch to trigger a docs publishing
LDiazN Nov 11, 2021
50249e9
fixed wrong import
LDiazN Nov 11, 2021
d322012
updated lockfile
LDiazN Nov 11, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/ci/logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import logging

# create logger
logger = logging.getLogger("microscope_doc_ci_logging")
logger.setLevel(logging.DEBUG)

# create console handler and set level to debug
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)

# create formatter
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")

# add formatter to ch
ch.setFormatter(formatter)

# add ch to logger
logger.addHandler(ch)
10 changes: 10 additions & 0 deletions .github/workflows/ci/quality_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import sys
from logger import logger

mkdocs_build_output = sys.argv[1]
if mkdocs_build_output != str(0):
logger.error('Error building mkdocs. Warnings were found.')
raise Exception('Error building mkdocs. Warnings were found.')
else:
logger.info('No warnings found building mkdocs.')

73 changes: 73 additions & 0 deletions .github/workflows/ci/stage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
'''
This script provides functionality for staging tags in documents.
It will remove documents sections enclosed with "##STAGING##" tags (without quotes).
This is intended to run before the building process of the documentation.
'''

import os
from typing import List
from logger import logger

# Only process files and folders within
FOLDERS = ('docs_en', 'docs_es')
ACTIVE_FILE_EXT = ('.md')
START_PATH = './docs/docs/'
STAGE_TAG = '## stage ##\n'

def get_stage_tags_positions(list_of_elems, element):
'''Returns the indexes of all occurrences of give element in
the list- listOfElements'''
index_pos_list = []
index_pos = 0
while True:
try:
# Search for item in list from indexPos to the end of list
index_pos = list_of_elems.index(element, index_pos)
# Add the index position in list
index_pos_list.append(index_pos)
index_pos += 1
except ValueError as e:
break
return index_pos_list

def sanitize_text(stage_tag_positions: List[int], lines: List[str]):
'''Returns the text lines that are not enclosed in ## stage ## tags'''
stage_tags_pairs = zip(stage_tag_positions[::2], stage_tag_positions[1::2])
c = 0
for p1, p2 in stage_tags_pairs:
del lines[p1:p2+1]
c += 1
logger.info(f'Total sanitized entries {c}')
return lines

def sanitize_file(root, file_name: str):
file = os.path.join(root, file_name)
f = open(file, 'r')
lines = f.readlines()
f.close()
stage_tags_linear = get_stage_tags_positions(lines, STAGE_TAG)
logger.info(f'Sanitizing file {file}...')
if len(stage_tags_linear) == 0:
logger.info('No staging sections to remove')
return
newText = sanitize_text(stage_tags_linear, lines)
f = open(file, 'w')
f.writelines(newText)
f.close()

def is_markdown_file(filename):
return filename.endswith(ACTIVE_FILE_EXT)

def process_files(top_tuple):
for topfolder, subfolder, filesintop in top_tuple:
"""Process files in topfolder first"""
for file in filesintop:
if is_markdown_file(file):
sanitize_file(topfolder, file)
"""Process subfolders"""
for folder in subfolder:
folder_walk = os.walk(os.path.join(topfolder, folder))
process_files(folder_walk)

root_folders_walk = [ walk for walk in os.walk(START_PATH) if walk[0].endswith(FOLDERS) ]
process_files(root_folders_walk)
56 changes: 56 additions & 0 deletions .github/workflows/publish_docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# This workflow will update the online documentation so that it matches the one we have written in mkdocs, it
# will be triggered by pushes to master.
# It will publish everything by pushing the new site to the gh-pages branch.

name: c4v-py-docs

on:
push:
branches:
- master # for testing, change to master when this work is ready to PR
jobs:
deploy:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@v2

- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: '3.8'
architecture: 'x64'

- name: Cache dependencies
uses: actions/cache@v1
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-

- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install -r ./requirements.txt

- name: Run staging sanitizer
run: python ./.github/workflows/ci/stage.py

- run: |
pushd docs
mkdocs build
popd
# cp ./CNAME ./docs/site/CNAME
# cp ./.nojekyll ./docs/site/.nojekyll

- name: Deploy
uses: peaceiris/actions-gh-pages@v3
with:
personal_token: ${{ secrets.DOCS_ACCESS_TOKEN }}
publish_dir: ./docs/site
publish_branch: docs
commit_message: "Update from commit: ${{ github.sha }} - ${{ github.event.head_commit.message }}"
allow_empty_commit: false
user_name: devops-c4v
user_email: [email protected]

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
> Solving Venezuela pressing matters one commmit at a time

`c4v-py` is a library used to address Venezuela's pressing issues
using computer and data science.
using computer and data science. Check the [online documentation](https://code-for-venezuela.github.io/c4v-py/)

- [Installation](#installation)
- [Development](#development)
Expand Down
Loading