3.0 (#128)

* Partial update * Bugfix * API update * Bugfixing and API * Bugfix * Fix long words OOM by skipping sentences * bugfixing and api update * Added language flavour * Added early stopping condition * Corrected naming * Corrected permissions * Bugfix * Added GPU support at runtime * Wrong config package * Refactoring * refactoring * add lightning to dependencies * Dummy test * Dummy test * Tweak * Tweak * Update test * Test * Finished loading for UD CONLL-U format * Working on tagger * Work on tagger * tagger training * tagger training * tagger training * Sync * Sync * Sync * Sync * Tagger working * Better weight for aux loss * Better weight for aux loss * Added save and printing for tagger and shared options class * Multilanguage evaluation * Saving multiple models * Updated ignore list * Added XLM-Roberta support * Using custom ro model * Score update * Bugfixing * Code refactor * Refactor * Added option to load external config * Added option to select LM-model from CLI or config * added option to overwrite config lm from CLI * Bugfix * Working on parser * Sync work on parser * Parser working * Removed load limit * Bugfix in evaluation * Added bi-affine attention * Added experimental ChuLiuEdmonds tree decoding * Better config for parser and bugfix * Added residuals to tagging * Model update * Switched to AdamW optimizer * Working on tokenizer * Working on tokenizer * Training working - validation to do * Bugfix in language id * Working on tokenization validation * Tokenizer working * YAML update * Bug in LMHelper * Tagger is working * Tokenizer is working * bfix * bfix * Bugfix for bugfix :) * Sync * Tokenizer worker * Tagger working * Trainer updates * Trainer process now working * Added .DS_Store * Added datasets for Compound Word Expander and Lemmatizer * Added collate function for lemma+compound * Added training and validation step * Updated config for Lemmatizer * Minor fixes * Removed duplicate entries from lemma and cwe * Added training support for lemmatizer * Removed debug directives * Lemmatizer in testing phase * removed unused line * Bugfix in Lemma dataset * Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing * Lemmatizier training done * Compound word expander ready * Sync * Added support for FastText, Transformers and Languasito LM models * Added multi-lm support for tokenizer * Added support for multiword tokens * Sync * Bugfix in evaluation * Added Languasito as a subpackage * Added path to local Languasito * Bugfixing all around * Removed debug printing * Bugfix for no-space languages that actually contain spaces :) * Bugfix for no-space languages that actually contain spaces :) * Fixed GPU support * Biaffine transform for LAS and relative head location (RHL) for UAS * Bugfix * Tweaks * moved rhl to lower layer * Added configurable option for RHL * Safenet for spaces in languages that should use no spaces * Better defaults * Sync * Cleanup parser * Bilinear xpos and attrs * Added Biaffine module from Stanza * Tagger with reduced number of parameters: * Parser with conditional attrs * Working on tokenizer runtime * Tokenizer process 90% done * Added runtime for parser, tokenizer and tagger * Added quick test for runtime * Test for e2e * Added support for multiple word embeddings at the same time * Bugfix * Added multiple word representations for tokenizer * moved mask_concat to utils.py * Added XPOS prediction to pipeline * Bugfix in tokenizer shifted word embeddings * Using Languasito tokenizer for HF tokenization * Bugfix * Bugfixing * Bugfixing * Bugfix * Runtime fixing * Sync * Added spa for FT and Languasito * Added spa for FT and Languasito * Minor tweaks * Added configuration for RNN layers * Bugfix for spa * HF runtime fix * Mixed test fasttext+transformer * Added word reconstruction and MHA * Sync * Bugfix * bugfix * Added masked attention * Sync * Added test for runtime * Bugfix in mask values * Updated test * Added full mask dropout * Added resume option * Removed useless printouts * Removed useless printouts * Switched to eval at runtime * multiprocessing added * Added full mask dropout for word decoder * Bugfix * Residual * Added lexical-contextual cosine loss * Removed full mask dropout from WordDecoder * Bugfix * Training script generation update * Added residual * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Not training for seq len > max_seq_len * Added seq limmits for collates * Passing seq limits from collate to tokenizer * Skipping complex parsing * Working on word decomposer * Model update * Sync * Bugfix * Bugfix * Bugfix * Using all reprs * Dropped immediate context * Multi train script added * Changed gpu parameter type to string, for multiple gpus int failed * Updated pytorch_lightning callback method to work with newer version * Updated pytorch_lightning callback method to work with newer version * Transparently pass PL args from the command line; skip over empty compound word datasets * Fix typo * Refactoring and on the way to working API * API load working * Partial _call_ working * Partial _call_ working * Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining. * api is working * Fixing api * Updated readme * Update Readme to include flavours * Device support * api update * Updated package * Tweak + results * Clarification * Test update * Update * Sync * Update README * Bugfixing * Bugfix and api update * Fixed compound * Evaluation update * Bugfix * Package update * Bugfix for large sentences * Pip package update * Corrected spanish evaluation * Package version update * Fixed tokenization issues on transformers * Removed pinned memory * Bugfix for GPU tensors * Update package version * Automatically detecting hidden state size * Automatically detecting hidden state size * Automatically detecting hidden state size * Sync * Evaluation update * Package update * Bugfix * Bugfixing * Package version update * Bugfix * Package version update * Update evaluation for Italian * tentative support torchtext>=0.9.0 (#127) as mentioned in Lightning-AI/pytorch-lightning#6211 and #100 * Update package dependencies Co-authored-by: Stefan Dumitrescu <[email protected]> Co-authored-by: dumitrescustefan <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Koichi Yasuoka <[email protected]>
adobe · Aug 27, 2021 · c759633 · c759633
1 parent a16373a
commit c759633
Show file tree

Hide file tree

Showing 633 changed files with 27,805 additions and 5,676 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -1,61 +1,19 @@
-version: 2
+version: 2.1
+
+orbs:
+  python: circleci/[email protected]
+
 jobs:
-  test_api_and_main_and_upload:
-    docker:
-      - image: circleci/python
+  build-and-test:
+    executor: python/default
     steps:
       - checkout
-      - run: 
-          name: init .pypirc
-          command: |
-            echo -e "[pypi]" >> ~/.pypirc
-      - run:
-          name: install requirements
-          command: |
-            sudo apt-get install -y libblas3 liblapack3
-            sudo apt-get install -y liblapack-dev libblas-dev
-            cd /home/circleci/project/
-            pip3 install --user -r requirements.txt
-      - run:
-          name: test main
-          command: |
-            cd /home/circleci/project/
-            python3 tests/main_tests.py 
-      - run:
-          name: test api
-          command: |
-            cd /home/circleci/project/
-            python3 tests/api_tests.py
-      - run:
-          name: create packages
-          command: |
-            python3 setup.py sdist
-            python3 setup.py bdist_wheel
-      - run:
-          name: upload to pypi
-          command: |
-            if [[ "$PYPI_USERNAME" == "" ]]; then
-              echo "Skip upload"
-              exit 0
-            fi
-            python3 -m pip install --user jq
-            if [[ "$CIRCLE_BRANCH" == "master" ]]; then
-              PYPI="pypi.org"
-            else
-              PYPI="test.pypi.org"
-            fi
-            LATEST_VERSION="$(curl -s https://$PYPI/pypi/nlpcube/json | jq -r '.info.version')"
-            THIS_VERSION=`python3 <<< "import pkg_resources;print(pkg_resources.require('nlpcube')[0].version)"`
-            if [[ $THIS_VERSION != $LATEST_VERSION ]]; then
-              echo "\n\nthis: $THIS_VERSION - latest: $LATEST_VERSION => releasing to $PYPI\n\n"
-              python3 -m pip install --user --upgrade twine
-              python3 -m twine upload --repository-url https://$PYPI/legacy/ dist/* -u $PYPI_USERNAME -p $PYPI_PASSWORD || echo "Package already exists"
-            else
-              echo "this: $THIS_VERSION = latest: $LATEST_VERSION => skip release"
-            fi
+      - python/load-cache
+      - python/install-deps
+      - python/save-cache
+      - run: echo "done"
 
 workflows:
-  version: 2
-  test_api_and_main_and_upload:
+  main:
     jobs:
-      - test_api_and_main_and_upload
+      - build-and-test
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,15 @@
+.DS_Store
+Languasito/data/
+*.txt
+lightning_logs
+*.gz
+*.encodings
+*.npy
+data/*
+nlp-cube-models/*
+corpus/
+models/
+scripts/packer
 *.pyc
 build/
 dist/
@@ -11,12 +23,14 @@ cube/venv/*
 .idea/*
 venv/*
 cube/*.py
+*.json
 
-models/
+scratch/
 tests/scratch/*
 scripts/*.json
 scripts/*.conllu
 scripts/*.md
+scripts/wikiextractor.py
 
 # Jupyter notebooks
 notebooks/.ipynb_checkpoints/*

diff --git a/Languasito/.idea/.gitignore b/Languasito/.idea/.gitignore
diff --git a/cube/.idea/cube.iml → Languasito/.idea/Languasito.iml b/cube/.idea/cube.iml → Languasito/.idea/Languasito.iml
diff --git a/Languasito/.idea/inspectionProfiles/Project_Default.xml b/Languasito/.idea/inspectionProfiles/Project_Default.xml
diff --git a/Languasito/.idea/inspectionProfiles/profiles_settings.xml b/Languasito/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/cube/.idea/misc.xml → Languasito/.idea/misc.xml b/cube/.idea/misc.xml → Languasito/.idea/misc.xml
diff --git a/cube/.idea/modules.xml → Languasito/.idea/modules.xml b/cube/.idea/modules.xml → Languasito/.idea/modules.xml
diff --git a/Languasito/.idea/other.xml b/Languasito/.idea/other.xml
diff --git a/cube/generic_networks/__init__.py → Languasito/languasito/__init__.py b/cube/generic_networks/__init__.py → Languasito/languasito/__init__.py
diff --git a/Languasito/languasito/api.py b/Languasito/languasito/api.py
@@ -0,0 +1,63 @@
+import sys
+import torch
+from typing import *
+
+sys.path.append('')
+
+from languasito.model import Languasito
+from languasito.utils import LanguasitoCollate
+from languasito.utils import Encodings
+
+
+class LanguasitoAPI:
+
+    def __init__(self, languasito: Languasito, encodings: Encodings):
+        self._languasito = languasito
+        self._languasito.eval()
+        self._encodings = encodings
+        self._collate = LanguasitoCollate(encodings, live=True)
+        self._device = 'cpu'
+
+    def to(self, device: str):
+        self._languasito.to(device)
+        self._device = device
+
+    def __call__(self, batch):
+        with torch.no_grad():
+            x = self._collate.collate_fn(batch)
+            for key in x:
+                if isinstance(x[key], torch.Tensor):
+                    x[key] = x[key].to(self._device)
+            rez = self._languasito(x)
+        emb = []
+        pred_emb = rez['emb'].detach().cpu().numpy()
+        for ii in range(len(batch)):
+            c_emb = []
+            for jj in range(len(batch[ii])):
+                c_emb.append(pred_emb[ii, jj])
+            emb.append(c_emb)
+        return emb
+
+    @staticmethod
+    def load(model_name: str):
+        from pathlib import Path
+        home = str(Path.home())
+        filename = '{0}/.languasito/{1}'.format(home, model_name)
+        import os
+        if os.path.exists(filename + '.encodings'):
+            return LanguasitoAPI.load_local(filename)
+        else:
+            print("UserWarning: Model not found and automatic downloading is not yet supported")
+            return None
+
+    @staticmethod
+    def load_local(model_name: str):
+        enc = Encodings()
+        enc.load('{0}.encodings'.format(model_name))
+        model = Languasito(enc)
+        tmp = torch.load('{0}.best'.format(model_name), map_location='cpu')
+        # model.load(tmp['state_dict'])
+        model.load_state_dict(tmp['state_dict'])
+        model.eval()
+        api = LanguasitoAPI(model, enc)
+        return api