Skip to content

Commit

Permalink
3.0 (#128)
Browse files Browse the repository at this point in the history
* Partial update

* Bugfix

* API update

* Bugfixing and API

* Bugfix

* Fix long words OOM by skipping sentences

* bugfixing and api update

* Added language flavour

* Added early stopping condition

* Corrected naming

* Corrected permissions

* Bugfix

* Added GPU support at runtime

* Wrong config package

* Refactoring

* refactoring

* add lightning to dependencies

* Dummy test

* Dummy test

* Tweak

* Tweak

* Update test

* Test

* Finished loading for UD CONLL-U format

* Working on tagger

* Work on tagger

* tagger training

* tagger training

* tagger training

* Sync

* Sync

* Sync

* Sync

* Tagger working

* Better weight for aux loss

* Better weight for aux loss

* Added save and printing for tagger and shared options class

* Multilanguage evaluation

* Saving multiple models

* Updated ignore list

* Added XLM-Roberta support

* Using custom ro model

* Score update

* Bugfixing

* Code refactor

* Refactor

* Added option to load external config

* Added option to select LM-model from CLI or config

* added option to overwrite config lm from CLI

* Bugfix

* Working on parser

* Sync work on parser

* Parser working

* Removed load limit

* Bugfix in evaluation

* Added bi-affine attention

* Added experimental ChuLiuEdmonds tree decoding

* Better config for parser and bugfix

* Added residuals to tagging

* Model update

* Switched to AdamW optimizer

* Working on tokenizer

* Working on tokenizer

* Training working - validation to do

* Bugfix in language id

* Working on tokenization validation

* Tokenizer working

* YAML update

* Bug in LMHelper

* Tagger is working

* Tokenizer is working

* bfix

* bfix

* Bugfix for bugfix :)

* Sync

* Tokenizer worker

* Tagger working

* Trainer updates

* Trainer process now working

* Added .DS_Store

* Added datasets for Compound Word Expander and Lemmatizer

* Added collate function for lemma+compound

* Added training and validation step

* Updated config for Lemmatizer

* Minor fixes

* Removed duplicate entries from lemma and cwe

* Added training support for lemmatizer

* Removed debug directives

* Lemmatizer in testing phase

* removed unused line

* Bugfix in Lemma dataset

* Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing

* Lemmatizier training done

* Compound word expander ready

* Sync

* Added support for FastText, Transformers and Languasito LM models

* Added multi-lm support for tokenizer

* Added support for multiword tokens

* Sync

* Bugfix in evaluation

* Added Languasito as a subpackage

* Added path to local Languasito

* Bugfixing all around

* Removed debug printing

* Bugfix for no-space languages that actually contain spaces :)

* Bugfix for no-space languages that actually contain spaces :)

* Fixed GPU support

* Biaffine transform for LAS and relative head location (RHL) for UAS

* Bugfix

* Tweaks

* moved rhl to lower layer

* Added configurable option for RHL

* Safenet for spaces in languages that should use no spaces

* Better defaults

* Sync

* Cleanup parser

* Bilinear xpos and attrs

* Added Biaffine module from Stanza

* Tagger with reduced number of parameters:

* Parser with conditional attrs

* Working on tokenizer runtime

* Tokenizer process 90% done

* Added runtime for parser, tokenizer and tagger

* Added quick test for runtime

* Test for e2e

* Added support for multiple word embeddings at the same time

* Bugfix

* Added multiple word representations for tokenizer

* moved mask_concat to utils.py

* Added XPOS prediction to pipeline

* Bugfix in tokenizer shifted word embeddings

* Using Languasito tokenizer for HF tokenization

* Bugfix

* Bugfixing

* Bugfixing

* Bugfix

* Runtime fixing

* Sync

* Added spa for FT and Languasito

* Added spa for FT and Languasito

* Minor tweaks

* Added configuration for RNN layers

* Bugfix for spa

* HF runtime fix

* Mixed test fasttext+transformer

* Added word reconstruction and MHA

* Sync

* Bugfix

* bugfix

* Added masked attention

* Sync

* Added test for runtime

* Bugfix in mask values

* Updated test

* Added full mask dropout

* Added resume option

* Removed useless printouts

* Removed useless printouts

* Switched to eval at runtime

* multiprocessing added

* Added full mask dropout for word decoder

* Bugfix

* Residual

* Added lexical-contextual cosine loss

* Removed full mask dropout from WordDecoder

* Bugfix

* Training script generation update

* Added residual

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Not training for seq len > max_seq_len

* Added seq limmits for collates

* Passing seq limits from collate to tokenizer

* Skipping complex parsing

* Working on word decomposer

* Model update

* Sync

* Bugfix

* Bugfix

* Bugfix

* Using all reprs

* Dropped immediate context

* Multi train script added

* Changed gpu parameter type to string, for multiple gpus int failed

* Updated pytorch_lightning callback method to work with newer version

* Updated pytorch_lightning callback method to work with newer version

* Transparently pass PL args from the command line; skip over empty compound word datasets

* Fix typo

* Refactoring and on the way to working API

* API load working

* Partial _call_ working

* Partial _call_ working

* Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining.

* api is working

* Fixing api

* Updated readme

* Update Readme to include flavours

* Device support

* api update

* Updated package

* Tweak + results

* Clarification

* Test update

* Update

* Sync

* Update README

* Bugfixing

* Bugfix and api update

* Fixed compound

* Evaluation update

* Bugfix

* Package update

* Bugfix for large sentences

* Pip package update

* Corrected spanish evaluation

* Package version update

* Fixed tokenization issues on transformers

* Removed pinned memory

* Bugfix for GPU tensors

* Update package version

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Sync

* Evaluation update

* Package update

* Bugfix

* Bugfixing

* Package version update

* Bugfix

* Package version update

* Update evaluation for Italian

* tentative support torchtext>=0.9.0 (#127)

as mentioned in Lightning-AI/pytorch-lightning#6211 and #100

* Update package dependencies

Co-authored-by: Stefan Dumitrescu <[email protected]>
Co-authored-by: dumitrescustefan <[email protected]>
Co-authored-by: Tiberiu Boros <[email protected]>
Co-authored-by: Tiberiu Boros <[email protected]>
Co-authored-by: Koichi Yasuoka <[email protected]>
  • Loading branch information
6 people authored Aug 27, 2021
1 parent a16373a commit c759633
Show file tree
Hide file tree
Showing 633 changed files with 27,805 additions and 5,676 deletions.
68 changes: 13 additions & 55 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,61 +1,19 @@
version: 2
version: 2.1

orbs:
python: circleci/[email protected]

jobs:
test_api_and_main_and_upload:
docker:
- image: circleci/python
build-and-test:
executor: python/default
steps:
- checkout
- run:
name: init .pypirc
command: |
echo -e "[pypi]" >> ~/.pypirc
- run:
name: install requirements
command: |
sudo apt-get install -y libblas3 liblapack3
sudo apt-get install -y liblapack-dev libblas-dev
cd /home/circleci/project/
pip3 install --user -r requirements.txt
- run:
name: test main
command: |
cd /home/circleci/project/
python3 tests/main_tests.py
- run:
name: test api
command: |
cd /home/circleci/project/
python3 tests/api_tests.py
- run:
name: create packages
command: |
python3 setup.py sdist
python3 setup.py bdist_wheel
- run:
name: upload to pypi
command: |
if [[ "$PYPI_USERNAME" == "" ]]; then
echo "Skip upload"
exit 0
fi
python3 -m pip install --user jq
if [[ "$CIRCLE_BRANCH" == "master" ]]; then
PYPI="pypi.org"
else
PYPI="test.pypi.org"
fi
LATEST_VERSION="$(curl -s https://$PYPI/pypi/nlpcube/json | jq -r '.info.version')"
THIS_VERSION=`python3 <<< "import pkg_resources;print(pkg_resources.require('nlpcube')[0].version)"`
if [[ $THIS_VERSION != $LATEST_VERSION ]]; then
echo "\n\nthis: $THIS_VERSION - latest: $LATEST_VERSION => releasing to $PYPI\n\n"
python3 -m pip install --user --upgrade twine
python3 -m twine upload --repository-url https://$PYPI/legacy/ dist/* -u $PYPI_USERNAME -p $PYPI_PASSWORD || echo "Package already exists"
else
echo "this: $THIS_VERSION = latest: $LATEST_VERSION => skip release"
fi
- python/load-cache
- python/install-deps
- python/save-cache
- run: echo "done"

workflows:
version: 2
test_api_and_main_and_upload:
main:
jobs:
- test_api_and_main_and_upload
- build-and-test
16 changes: 15 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
.DS_Store
Languasito/data/
*.txt
lightning_logs
*.gz
*.encodings
*.npy
data/*
nlp-cube-models/*
corpus/
models/
scripts/packer
*.pyc
build/
dist/
Expand All @@ -11,12 +23,14 @@ cube/venv/*
.idea/*
venv/*
cube/*.py
*.json

models/
scratch/
tests/scratch/*
scripts/*.json
scripts/*.conllu
scripts/*.md
scripts/wikiextractor.py

# Jupyter notebooks
notebooks/.ipynb_checkpoints/*
Expand Down
8 changes: 8 additions & 0 deletions Languasito/.idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 2 additions & 5 deletions cube/.idea/cube.iml → Languasito/.idea/Languasito.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

47 changes: 47 additions & 0 deletions Languasito/.idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions Languasito/.idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion cube/.idea/misc.xml → Languasito/.idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion cube/.idea/modules.xml → Languasito/.idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions Languasito/.idea/other.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

File renamed without changes.
63 changes: 63 additions & 0 deletions Languasito/languasito/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import sys
import torch
from typing import *

sys.path.append('')

from languasito.model import Languasito
from languasito.utils import LanguasitoCollate
from languasito.utils import Encodings


class LanguasitoAPI:

def __init__(self, languasito: Languasito, encodings: Encodings):
self._languasito = languasito
self._languasito.eval()
self._encodings = encodings
self._collate = LanguasitoCollate(encodings, live=True)
self._device = 'cpu'

def to(self, device: str):
self._languasito.to(device)
self._device = device

def __call__(self, batch):
with torch.no_grad():
x = self._collate.collate_fn(batch)
for key in x:
if isinstance(x[key], torch.Tensor):
x[key] = x[key].to(self._device)
rez = self._languasito(x)
emb = []
pred_emb = rez['emb'].detach().cpu().numpy()
for ii in range(len(batch)):
c_emb = []
for jj in range(len(batch[ii])):
c_emb.append(pred_emb[ii, jj])
emb.append(c_emb)
return emb

@staticmethod
def load(model_name: str):
from pathlib import Path
home = str(Path.home())
filename = '{0}/.languasito/{1}'.format(home, model_name)
import os
if os.path.exists(filename + '.encodings'):
return LanguasitoAPI.load_local(filename)
else:
print("UserWarning: Model not found and automatic downloading is not yet supported")
return None

@staticmethod
def load_local(model_name: str):
enc = Encodings()
enc.load('{0}.encodings'.format(model_name))
model = Languasito(enc)
tmp = torch.load('{0}.best'.format(model_name), map_location='cpu')
# model.load(tmp['state_dict'])
model.load_state_dict(tmp['state_dict'])
model.eval()
api = LanguasitoAPI(model, enc)
return api
Loading

0 comments on commit c759633

Please sign in to comment.