QuickUMLS v.1.4 (#55)

soldni · web-flow · commit bad4c50ae654 · 2020-05-13T08:52:13.000-07:00
Release Notes: - [NEW] Added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types. This allows creating multiple QuickUMLS matchers with from the same installation. - [NEW] added support for conversion of all uppercase words ([#48](#48), thank you sandertan@!). - [NEW] Automatically downloads SpaCy data for selected language if missing. - [FIX] Mitigated [#52](#52).
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[**NEW: v.1.3 is pip-ready!**](https://giphy.com/embed/BlVnrxJgTGsUw) You can now install QuickUMLS through a simple `pip install quickumls`.
+[**NEW: v.1.4 supports starting multiple QuickUMLS matchers concurrently!**](https://giphy.com/embed/BlVnrxJgTGsUw) I've finally added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types (see [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) for more details). unqlite-backed QuickUMLS installation support multiple matchers running at the same time. Other than better multi-processing support, unqlite should have better support for unicode.
 
 # QuickUMLS
 
@@ -11,12 +11,12 @@ This project should be compatible with Python 3 (Python 2 is [no longer supporte
 ## Installation
 
 1. **Obtain a UMLS installation** This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a [license](https://uts.nlm.nih.gov/license.html) from the National Library of Medicine; then you should download all UMLS files from [this page](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html); finally, you can install UMLS using the [MetamorphoSys](https://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html) tool as [explained in this guide](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html).  The installation can be removed once the system has been initialized.
-2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>. 
-3. **Obrain a SpaCy corpus**: After you install QuickUMLS and its dependencies, you should be able to do so by running `python -m spacy download en`.
+2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
 3. **Create a QuickUMLS installation** Initialize the system by running `python -m quickumls.install <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes). `python -m quickumls.install` supports the following optional arguments:
     - `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
     - `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
     - `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
+    - `-d` / `--database-backend`: Specify which database backend to use for QuickUMLS. The two options are `leveldb` and `unqlite`. The latter supports multi-process reading and has better unicode compatibility, and it used as default for all new 1.4 installations; the former is still used as default when instantiating a QuickUMLS client. More info about differences between the two databases and migration info are available [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4).
 
 
 **†**: If the installation fails on macOS when using Anaconda, install `leveldb` first by running `conda install -c conda-forge python-leveldb`.
@@ -50,6 +50,8 @@ matcher.match(text, best_match=True, ignore_syntax=False)
 
 Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).
 
+If the matcher throws a warning during initialization, read [this page](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) to learn why and how to stop it from doing so.
+
 
 ## Server / Client Support
 
diff --git a/quickumls/about.py b/quickumls/about.py
@@ -4,9 +4,9 @@
 # https://github.com/explosion/spaCy/blob/master/spacy/about.py
 
 __title__ = 'quickumls'
-__version__ = '1.3.0r4'
+__version__ = '1.4.0r1'
 __author__ = 'Luca Soldaini'
 __email__ = 'luca@ir.cs.georgetown.edu'
 __license__ = 'MIT'
 __uri__ = "https://github.com/Georgetown-IR-Lab/QuickUMLS"
-__copyright__ = '2014-2019, Georgetown University Information Retrieval Lab'
+__copyright__ = '2014-2020, Georgetown University Information Retrieval Lab'
diff --git a/quickumls/core.py b/quickumls/core.py
@@ -126,6 +126,19 @@ def __init__(
             )
             spacy_lang = constants.SPACY_LANGUAGE_MAP[self.language_flag]
 
+        database_backend_fp = os.path.join(quickumls_fp, 'database_backend.flag')
+        if os.path.exists(database_backend_fp):
+            with open(database_backend_fp) as f:
+                self._database_backend = f.read().strip()
+        else:
+            print('[WARNING] This installation was created with QuickUMLS v.1.3 or earlier, '
+                  'which does not support multiple database backends. For now, I\'ll '
+                  'assume that leveldb was used as default, implicit assumption will '
+                  'change in future versions of QuickUMLS. More info here: '
+                  'https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4',
+                  file=sys.stderr)
+            self._database_backend = 'leveldb'
+
         # domain specific stopwords
         self._stopwords = self._stopwords.union(constants.DOMAIN_SPECIFIC_STOPWORDS)
 
@@ -149,7 +162,9 @@ def __init__(
         self.ss_db = toolbox.SimstringDBReader(
             simstring_fp, similarity_name, threshold
         )
-        self.cuisem_db = toolbox.CuiSemTypesDB(cuisem_fp)
+        self.cuisem_db = toolbox.CuiSemTypesDB(
+            cuisem_fp, database_backend=self._database_backend
+        )
 
     def get_info(self):
         """Computes a summary of the matcher options.
diff --git a/quickumls/install.py b/quickumls/install.py
@@ -1,24 +1,28 @@
 from __future__ import unicode_literals, division, print_function
 
 # built in modules
+import argparse
+import codecs
 import os
+from six.moves import input
+import shutil
 import sys
 import time
-import codecs
-import shutil
-import argparse
-from six.moves import input
-
-# project modules
-from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
-from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES
-
 try:
     from unidecode import unidecode
 except ImportError:
     pass
 
 
+# third party-dependencies
+import spacy
+
+
+# project modules
+from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
+from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES, SPACY_LANGUAGE_MAP
+
+
 def get_semantic_types(path, headers):
     sem_types = {}
     with codecs.open(path, encoding='utf-8') as f:
@@ -98,13 +102,13 @@ def extract_from_mrconso(
     print(status)
 
 
-def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
+def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir, database_backend):
     # Create destination directories for the two databases
     mkdir(simstring_dir)
     mkdir(cuisty_dir)
 
     ss_db = SimstringDBWriter(simstring_dir)
-    cuisty_db = CuiSemTypesDB(cuisty_dir)
+    cuisty_db = CuiSemTypesDB(cuisty_dir, database_backend=database_backend)
 
     simstring_terms = set()
 
@@ -116,6 +120,20 @@ def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
         cuisty_db.insert(term, cui, stys, preferred)
 
 
+def install_spacy(lang):
+    """Tries to create a spacy object; if it fails, downloads the dataset"""
+
+    print(f'Determining if SpaCy for language "{lang}" is installed...')
+
+    if lang in SPACY_LANGUAGE_MAP:
+        try:
+            spacy.load(SPACY_LANGUAGE_MAP[lang])
+            print(f'SpaCy is installed and avaliable for {lang}!')
+        except OSError:
+            print(f'SpaCy is not available! Attempting to download and install...')
+            spacy.cli.download(SPACY_LANGUAGE_MAP[lang])
+
+
 def parse_args():
     ap = argparse.ArgumentParser()
     ap.add_argument(
@@ -135,6 +153,10 @@ def parse_args():
         '-U', '--normalize-unicode', action='store_true',
         help='Normalize unicode strings to their closest ASCII representation'
     )
+    ap.add_argument(
+        '-d', '--database-backend', choices=('leveldb', 'unqlite'), default='unqlite',
+        help='KV database to use to store CUIs and semantic types'
+    )
     ap.add_argument(
         '-E', '--language', default='ENG', choices=LANGUAGES,
         help='Extract concepts of the specified language'
@@ -146,6 +168,8 @@ def parse_args():
 def main():
     opts = parse_args()
 
+    install_spacy(opts.language)
+
     if not os.path.exists(opts.destination_path):
         msg = ('Directory "{}" does not exists; should I create it? [y/N] '
                ''.format(opts.destination_path))
@@ -189,6 +213,10 @@ def main():
     with open(flag_fp, 'w') as f:
         f.write(opts.language)
 
+    flag_fp = os.path.join(opts.destination_path, 'database_backend.flag')
+    with open(flag_fp, 'w') as f:
+        f.write(opts.database_backend)
+
     mrconso_path = os.path.join(opts.umls_installation_path, 'MRCONSO.RRF')
     mrsty_path = os.path.join(opts.umls_installation_path, 'MRSTY.RRF')
 
@@ -197,7 +225,8 @@ def main():
     simstring_dir = os.path.join(opts.destination_path, 'umls-simstring.db')
     cuisty_dir = os.path.join(opts.destination_path, 'cui-semtypes.db')
 
-    parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir)
+    parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir,
+                            database_backend=opts.database_backend)
 
 
 if __name__ == '__main__':
diff --git a/quickumls/toolbox.py b/quickumls/toolbox.py
@@ -3,6 +3,7 @@
 # build-in modules
 import re
 import os
+from functools import wraps
 import six
 import unicodedata
 from string import punctuation
@@ -12,6 +13,11 @@
 # installed modules
 import numpy
 import leveldb
+try:
+    import unqlite
+    UNQLITE_AVAILABLE = True
+except ImportError:
+    UNQLITE_AVAILABLE = False
 
 # project imports
 from quickumls_simstring import simstring
@@ -216,21 +222,37 @@ def append(self, interval):
 
 
 class CuiSemTypesDB(object):
-    def __init__(self, path):
+    def __init__(self, path, database_backend='leveldb'):
         if not (os.path.exists(path) or os.path.isdir(path)):
             err_msg = (
                 '"{}" is not a valid directory').format(path)
             raise IOError(err_msg)
 
-        self.cui_db = leveldb.LevelDB(
-            os.path.join(path, 'cui.leveldb'))
-        self.semtypes_db = leveldb.LevelDB(
-            os.path.join(path, 'semtypes.leveldb'))
+        if database_backend == 'unqlite':
+            assert UNQLITE_AVAILABLE, (
+                'You selected unqlite as database backend, but it is not '
+                'installed. Please install it via `pip install unqlite`'
+            )
+            self.cui_db = unqlite.UnQLite(os.path.join(path, 'cui.unqlite'))
+            self.cui_db_put = self.cui_db.store
+            self.cui_db_get = self.cui_db.fetch
+            self.semtypes_db = unqlite.UnQLite(os.path.join(path, 'semtypes.unqlite'))
+            self.semtypes_db_put = self.semtypes_db.store
+            self.semtypes_db_get = self.semtypes_db.fetch
+        elif database_backend == 'leveldb':
+            self.cui_db = leveldb.LevelDB(os.path.join(path, 'cui.leveldb'))
+            self.cui_db_put = self.cui_db.Put
+            self.cui_db_get = self.cui_db.Get
+            self.semtypes_db = leveldb.LevelDB(os.path.join(path, 'semtypes.leveldb'))
+            self.semtypes_db_put = self.semtypes_db.Put
+            self.semtypes_db_get = self.semtypes_db.Get
+        else:
+            raise ValueError(f'database_backend {database_backend} not recognized')
 
     def has_term(self, term):
         term = prepare_string_for_db_input(safe_unicode(term))
         try:
-            self.cui_db.Get(db_key_encode(term))
+            self.cui_db_get(db_key_encode(term))
             return True
         except KeyError:
             return
@@ -242,28 +264,31 @@ def insert(self, term, cui, semtypes, is_preferred):
         # some terms have multiple cuis associated with them,
         # so we store them all
         try:
-            cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
+            cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
         except KeyError:
             cuis = set()
 
         cuis.add((cui, is_preferred))
-        self.cui_db.Put(db_key_encode(term), pickle.dumps(cuis))
+        self.cui_db_put(db_key_encode(term), pickle.dumps(cuis))
 
         try:
-            self.semtypes_db.Get(db_key_encode(cui))
+            self.semtypes_db_get(db_key_encode(cui))
         except KeyError:
-            self.semtypes_db.Put(
+            self.semtypes_db_put(
                 db_key_encode(cui), pickle.dumps(set(semtypes))
             )
 
     def get(self, term):
         term = prepare_string_for_db_input(safe_unicode(term))
+        try:
+            cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
+        except KeyError:
+            cuis = set()
 
-        cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
         matches = (
             (
                 cui,
-                pickle.loads(self.semtypes_db.Get(db_key_encode(cui))),
+                pickle.loads(self.semtypes_db_get(db_key_encode(cui))),
                 is_preferred
             )
             for cui, is_preferred in cuis
diff --git a/requirements.txt b/requirements.txt
@@ -3,4 +3,5 @@ numpy>=1.8.2
 spacy>=1.6.0
 unidecode>=0.4.19
 nltk>=3.3
-quickumls_simstring>=1.1.5r1
+quickumls_simstring>=1.1.5r1
+unqlite>=0.8.1