Skip to content

Commit bad4c50

Browse files
authored
QuickUMLS v.1.4 (#55)
Release Notes: - [NEW] Added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types. This allows creating multiple QuickUMLS matchers with from the same installation. - [NEW] added support for conversion of all uppercase words ([#48](#48), thank you sandertan@!). - [NEW] Automatically downloads SpaCy data for selected language if missing. - [FIX] Mitigated [#52](#52).
1 parent bd58713 commit bad4c50

File tree

6 files changed

+103
-31
lines changed

6 files changed

+103
-31
lines changed

README.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[**NEW: v.1.3 is pip-ready!**](https://giphy.com/embed/BlVnrxJgTGsUw) You can now install QuickUMLS through a simple `pip install quickumls`.
1+
[**NEW: v.1.4 supports starting multiple QuickUMLS matchers concurrently!**](https://giphy.com/embed/BlVnrxJgTGsUw) I've finally added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types (see [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) for more details). unqlite-backed QuickUMLS installation support multiple matchers running at the same time. Other than better multi-processing support, unqlite should have better support for unicode.
22

33
# QuickUMLS
44

@@ -11,12 +11,12 @@ This project should be compatible with Python 3 (Python 2 is [no longer supporte
1111
## Installation
1212

1313
1. **Obtain a UMLS installation** This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a [license](https://uts.nlm.nih.gov/license.html) from the National Library of Medicine; then you should download all UMLS files from [this page](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html); finally, you can install UMLS using the [MetamorphoSys](https://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html) tool as [explained in this guide](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html). The installation can be removed once the system has been initialized.
14-
2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
15-
3. **Obrain a SpaCy corpus**: After you install QuickUMLS and its dependencies, you should be able to do so by running `python -m spacy download en`.
14+
2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
1615
3. **Create a QuickUMLS installation** Initialize the system by running `python -m quickumls.install <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes). `python -m quickumls.install` supports the following optional arguments:
1716
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
1817
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
1918
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
19+
- `-d` / `--database-backend`: Specify which database backend to use for QuickUMLS. The two options are `leveldb` and `unqlite`. The latter supports multi-process reading and has better unicode compatibility, and it used as default for all new 1.4 installations; the former is still used as default when instantiating a QuickUMLS client. More info about differences between the two databases and migration info are available [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4).
2020

2121

2222
****: If the installation fails on macOS when using Anaconda, install `leveldb` first by running `conda install -c conda-forge python-leveldb`.
@@ -50,6 +50,8 @@ matcher.match(text, best_match=True, ignore_syntax=False)
5050

5151
Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).
5252

53+
If the matcher throws a warning during initialization, read [this page](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) to learn why and how to stop it from doing so.
54+
5355

5456
## Server / Client Support
5557

quickumls/about.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
# https://github.com/explosion/spaCy/blob/master/spacy/about.py
55

66
__title__ = 'quickumls'
7-
__version__ = '1.3.0r4'
7+
__version__ = '1.4.0r1'
88
__author__ = 'Luca Soldaini'
99
__email__ = '[email protected]'
1010
__license__ = 'MIT'
1111
__uri__ = "https://github.com/Georgetown-IR-Lab/QuickUMLS"
12-
__copyright__ = '2014-2019, Georgetown University Information Retrieval Lab'
12+
__copyright__ = '2014-2020, Georgetown University Information Retrieval Lab'

quickumls/core.py

+16-1
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,19 @@ def __init__(
126126
)
127127
spacy_lang = constants.SPACY_LANGUAGE_MAP[self.language_flag]
128128

129+
database_backend_fp = os.path.join(quickumls_fp, 'database_backend.flag')
130+
if os.path.exists(database_backend_fp):
131+
with open(database_backend_fp) as f:
132+
self._database_backend = f.read().strip()
133+
else:
134+
print('[WARNING] This installation was created with QuickUMLS v.1.3 or earlier, '
135+
'which does not support multiple database backends. For now, I\'ll '
136+
'assume that leveldb was used as default, implicit assumption will '
137+
'change in future versions of QuickUMLS. More info here: '
138+
'https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4',
139+
file=sys.stderr)
140+
self._database_backend = 'leveldb'
141+
129142
# domain specific stopwords
130143
self._stopwords = self._stopwords.union(constants.DOMAIN_SPECIFIC_STOPWORDS)
131144

@@ -149,7 +162,9 @@ def __init__(
149162
self.ss_db = toolbox.SimstringDBReader(
150163
simstring_fp, similarity_name, threshold
151164
)
152-
self.cuisem_db = toolbox.CuiSemTypesDB(cuisem_fp)
165+
self.cuisem_db = toolbox.CuiSemTypesDB(
166+
cuisem_fp, database_backend=self._database_backend
167+
)
153168

154169
def get_info(self):
155170
"""Computes a summary of the matcher options.

quickumls/install.py

+41-12
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,28 @@
11
from __future__ import unicode_literals, division, print_function
22

33
# built in modules
4+
import argparse
5+
import codecs
46
import os
7+
from six.moves import input
8+
import shutil
59
import sys
610
import time
7-
import codecs
8-
import shutil
9-
import argparse
10-
from six.moves import input
11-
12-
# project modules
13-
from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
14-
from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES
15-
1611
try:
1712
from unidecode import unidecode
1813
except ImportError:
1914
pass
2015

2116

17+
# third party-dependencies
18+
import spacy
19+
20+
21+
# project modules
22+
from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
23+
from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES, SPACY_LANGUAGE_MAP
24+
25+
2226
def get_semantic_types(path, headers):
2327
sem_types = {}
2428
with codecs.open(path, encoding='utf-8') as f:
@@ -98,13 +102,13 @@ def extract_from_mrconso(
98102
print(status)
99103

100104

101-
def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
105+
def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir, database_backend):
102106
# Create destination directories for the two databases
103107
mkdir(simstring_dir)
104108
mkdir(cuisty_dir)
105109

106110
ss_db = SimstringDBWriter(simstring_dir)
107-
cuisty_db = CuiSemTypesDB(cuisty_dir)
111+
cuisty_db = CuiSemTypesDB(cuisty_dir, database_backend=database_backend)
108112

109113
simstring_terms = set()
110114

@@ -116,6 +120,20 @@ def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
116120
cuisty_db.insert(term, cui, stys, preferred)
117121

118122

123+
def install_spacy(lang):
124+
"""Tries to create a spacy object; if it fails, downloads the dataset"""
125+
126+
print(f'Determining if SpaCy for language "{lang}" is installed...')
127+
128+
if lang in SPACY_LANGUAGE_MAP:
129+
try:
130+
spacy.load(SPACY_LANGUAGE_MAP[lang])
131+
print(f'SpaCy is installed and avaliable for {lang}!')
132+
except OSError:
133+
print(f'SpaCy is not available! Attempting to download and install...')
134+
spacy.cli.download(SPACY_LANGUAGE_MAP[lang])
135+
136+
119137
def parse_args():
120138
ap = argparse.ArgumentParser()
121139
ap.add_argument(
@@ -135,6 +153,10 @@ def parse_args():
135153
'-U', '--normalize-unicode', action='store_true',
136154
help='Normalize unicode strings to their closest ASCII representation'
137155
)
156+
ap.add_argument(
157+
'-d', '--database-backend', choices=('leveldb', 'unqlite'), default='unqlite',
158+
help='KV database to use to store CUIs and semantic types'
159+
)
138160
ap.add_argument(
139161
'-E', '--language', default='ENG', choices=LANGUAGES,
140162
help='Extract concepts of the specified language'
@@ -146,6 +168,8 @@ def parse_args():
146168
def main():
147169
opts = parse_args()
148170

171+
install_spacy(opts.language)
172+
149173
if not os.path.exists(opts.destination_path):
150174
msg = ('Directory "{}" does not exists; should I create it? [y/N] '
151175
''.format(opts.destination_path))
@@ -189,6 +213,10 @@ def main():
189213
with open(flag_fp, 'w') as f:
190214
f.write(opts.language)
191215

216+
flag_fp = os.path.join(opts.destination_path, 'database_backend.flag')
217+
with open(flag_fp, 'w') as f:
218+
f.write(opts.database_backend)
219+
192220
mrconso_path = os.path.join(opts.umls_installation_path, 'MRCONSO.RRF')
193221
mrsty_path = os.path.join(opts.umls_installation_path, 'MRSTY.RRF')
194222

@@ -197,7 +225,8 @@ def main():
197225
simstring_dir = os.path.join(opts.destination_path, 'umls-simstring.db')
198226
cuisty_dir = os.path.join(opts.destination_path, 'cui-semtypes.db')
199227

200-
parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir)
228+
parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir,
229+
database_backend=opts.database_backend)
201230

202231

203232
if __name__ == '__main__':

quickumls/toolbox.py

+37-12
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# build-in modules
44
import re
55
import os
6+
from functools import wraps
67
import six
78
import unicodedata
89
from string import punctuation
@@ -12,6 +13,11 @@
1213
# installed modules
1314
import numpy
1415
import leveldb
16+
try:
17+
import unqlite
18+
UNQLITE_AVAILABLE = True
19+
except ImportError:
20+
UNQLITE_AVAILABLE = False
1521

1622
# project imports
1723
from quickumls_simstring import simstring
@@ -216,21 +222,37 @@ def append(self, interval):
216222

217223

218224
class CuiSemTypesDB(object):
219-
def __init__(self, path):
225+
def __init__(self, path, database_backend='leveldb'):
220226
if not (os.path.exists(path) or os.path.isdir(path)):
221227
err_msg = (
222228
'"{}" is not a valid directory').format(path)
223229
raise IOError(err_msg)
224230

225-
self.cui_db = leveldb.LevelDB(
226-
os.path.join(path, 'cui.leveldb'))
227-
self.semtypes_db = leveldb.LevelDB(
228-
os.path.join(path, 'semtypes.leveldb'))
231+
if database_backend == 'unqlite':
232+
assert UNQLITE_AVAILABLE, (
233+
'You selected unqlite as database backend, but it is not '
234+
'installed. Please install it via `pip install unqlite`'
235+
)
236+
self.cui_db = unqlite.UnQLite(os.path.join(path, 'cui.unqlite'))
237+
self.cui_db_put = self.cui_db.store
238+
self.cui_db_get = self.cui_db.fetch
239+
self.semtypes_db = unqlite.UnQLite(os.path.join(path, 'semtypes.unqlite'))
240+
self.semtypes_db_put = self.semtypes_db.store
241+
self.semtypes_db_get = self.semtypes_db.fetch
242+
elif database_backend == 'leveldb':
243+
self.cui_db = leveldb.LevelDB(os.path.join(path, 'cui.leveldb'))
244+
self.cui_db_put = self.cui_db.Put
245+
self.cui_db_get = self.cui_db.Get
246+
self.semtypes_db = leveldb.LevelDB(os.path.join(path, 'semtypes.leveldb'))
247+
self.semtypes_db_put = self.semtypes_db.Put
248+
self.semtypes_db_get = self.semtypes_db.Get
249+
else:
250+
raise ValueError(f'database_backend {database_backend} not recognized')
229251

230252
def has_term(self, term):
231253
term = prepare_string_for_db_input(safe_unicode(term))
232254
try:
233-
self.cui_db.Get(db_key_encode(term))
255+
self.cui_db_get(db_key_encode(term))
234256
return True
235257
except KeyError:
236258
return
@@ -242,28 +264,31 @@ def insert(self, term, cui, semtypes, is_preferred):
242264
# some terms have multiple cuis associated with them,
243265
# so we store them all
244266
try:
245-
cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
267+
cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
246268
except KeyError:
247269
cuis = set()
248270

249271
cuis.add((cui, is_preferred))
250-
self.cui_db.Put(db_key_encode(term), pickle.dumps(cuis))
272+
self.cui_db_put(db_key_encode(term), pickle.dumps(cuis))
251273

252274
try:
253-
self.semtypes_db.Get(db_key_encode(cui))
275+
self.semtypes_db_get(db_key_encode(cui))
254276
except KeyError:
255-
self.semtypes_db.Put(
277+
self.semtypes_db_put(
256278
db_key_encode(cui), pickle.dumps(set(semtypes))
257279
)
258280

259281
def get(self, term):
260282
term = prepare_string_for_db_input(safe_unicode(term))
283+
try:
284+
cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
285+
except KeyError:
286+
cuis = set()
261287

262-
cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
263288
matches = (
264289
(
265290
cui,
266-
pickle.loads(self.semtypes_db.Get(db_key_encode(cui))),
291+
pickle.loads(self.semtypes_db_get(db_key_encode(cui))),
267292
is_preferred
268293
)
269294
for cui, is_preferred in cuis

requirements.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ numpy>=1.8.2
33
spacy>=1.6.0
44
unidecode>=0.4.19
55
nltk>=3.3
6-
quickumls_simstring>=1.1.5r1
6+
quickumls_simstring>=1.1.5r1
7+
unqlite>=0.8.1

0 commit comments

Comments
 (0)