Skip to content

Commit e701161

Browse files
authored
Added support for pip install (#47)
Fixes #18.
1 parent 512fd64 commit e701161

15 files changed

+180
-184
lines changed

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include requirements.txt
2+
include README.md

README.md

+19-25
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,33 @@
1-
**We recommend to download the latest tested version from the [releases section](https://github.com/Georgetown-IR-Lab/QuickUMLS/releases)**.
2-
3-
**NEW: v.1.2 now includes client/server support!** Start a QuickUMLS server once, avoid loading QuickUMLS each time your experiments run! See <a href="#client_server">below</a> for more info.
1+
[**NEW: v.1.3 is pip-ready!**](https://giphy.com/embed/BlVnrxJgTGsUw) You can now install QuickUMLS through a simple `pip install quickumls`.
42

53
# QuickUMLS
64

75
QuickUMLS (Soldaini and Goharian, 2016) is a tool for fast, unsupervised biomedical concept extraction from medical text.
86
It takes advantage of [Simstring](http://www.chokkan.org/software/simstring/) (Okazaki and Tsujii, 2010) for approximate string matching.
97
For more details on how QuickUMLS works, we remand to our paper.
108

11-
This project should be compatible with both Python 2 and 3 and run on any UNIX system (support for Windows is experimental, please report bugs!). **If you find any bugs, please file an issue on GitHub or email the author at [email protected]**.
9+
This project should be compatible with Python 3 (Python 2 is [no longer supported](https://pythonclock.org/)) and run on any UNIX system (support for Windows is experimental, please report bugs!). **If you find any bugs, please file an issue on GitHub or email the author at [email protected]**.
1210

1311
## Installation
1412

15-
#### Before Starting
16-
17-
1. Make sure that your Python installation include C headers (e.g., on Ubuntu, make sure `python3-dev` or `python-dev` are installed).
18-
2. This software requires all packages listed in the `requirements.txt` file. You can install all of them by running `pip install -r requirements.txt`.
19-
3. Note that, in order to use `spacy`, you are required to download its corpus. You can do that by running `python -m spacy download en`.
20-
4. This system requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a [license](https://uts.nlm.nih.gov/license.html) from the National Library of Medicine; then you should download all UMLS files from [this page](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html); finally, you can install UMLS using the [MetamorphoSys](https://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html) tool as [explained in this guide](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html). The installation can be removed once the system has been initialized.
13+
1. **Obtain a UMLS installation** This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a [license](https://uts.nlm.nih.gov/license.html) from the National Library of Medicine; then you should download all UMLS files from [this page](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html); finally, you can install UMLS using the [MetamorphoSys](https://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html) tool as [explained in this guide](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html). The installation can be removed once the system has been initialized.
14+
2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
15+
3. **Obrain a SpaCy corpus**: After you install QuickUMLS and its dependencies, you should be able to do so by running `python -m spacy download en`.
16+
3. **Create a QuickUMLS installation** Initialize the system by running `python -m quickumls.install <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes). `python -m quickumls.install` supports the following optional arguments:
17+
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
18+
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
19+
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
2120

22-
#### How To get the System Initialized
2321

24-
1. Download and compile Simstring by running `bash setup_simstring.sh <python_version>`, where `<python_version>` is either "`2`" or "`3`".
25-
2. Initialize the system by running `python install.py <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200RPM hard drive, initialization takes 8.5 minutes).
26-
27-
`install.py` supports the following optional arguments:
28-
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
29-
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
30-
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
22+
****: If the installation fails on macOS when using Anaconda, install `leveldb` first by running `conda install -c conda-forge python-leveldb`.
3123

3224
## APIs
3325

3426
A QuickUMLS object can be instantiated as follows:
3527

3628
```python
29+
from quickumls import QuickUMLS
30+
3731
matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
3832
similarity_name, window, accepted_semtypes)
3933
```
@@ -57,22 +51,22 @@ matcher.match(text, best_match=True, ignore_syntax=False)
5751
Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).
5852

5953

60-
<h2 id="client_server">[NEW] Server / Client Support</h2>
54+
## Server / Client Support
6155

6256
Starting with v.1.2, QuickUMLS includes a support for being used in a client-server configuration. That is, you can start one QuickUMLS server, and query it from multiple scripts using a client.
6357

64-
To start the server, run `server.py`:
58+
To start the server, run `python -m quickumls.server`:
6559

6660
```bash
67-
python server.py /path/to/quickumls/files {-P QuickUMLS port} {-H QuickUMLS host} {QuickUMLS options}
61+
python -m quickumls.server /path/to/quickumls/files {-P QuickUMLS port} {-H QuickUMLS host} {QuickUMLS options}
6862
```
6963

70-
Host and port are optional; by default, QuickUMLS runs on `localhost:4645`. You can also pass any QuickUMLS option mentioned above to the server. To obtain a list of options for the server, run `python server.py -h`.
64+
Host and port are optional; by default, QuickUMLS runs on `localhost:4645`. You can also pass any QuickUMLS option mentioned above to the server. To obtain a list of options for the server, run `python -m quickumls.server -h`.
7165

72-
To load the client, import `get_quickumls_client` from `client.py`:
66+
To load the client, import `get_quickumls_client` from `quickumls`:
7367

7468
```bash
75-
from client import get_quickumls_client
69+
from quickumls import get_quickumls_client
7670
matcher = get_quickumls_client()
7771
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
7872
matcher.match(text, best_match=True, ignore_syntax=False)
@@ -84,7 +78,7 @@ The API of the client is the same of a QuickUMLS object.
8478
In case you wish to run the server in the background, you can do so as follows:
8579

8680
```bash
87-
nohup python server.py /path/to/QuickUMLS {server options} > /dev/null 2>&1 & echo $! > nohup.pid
81+
nohup python -m quickumls.server /path/to/QuickUMLS {server options} > /dev/null 2>&1 & echo $! > nohup.pid
8882

8983
```
9084

__init__.py

Whitespace-only changes.

quickumls/__init__.py

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .core import QuickUMLS
2+
from .client import get_quickumls_client
3+
from .about import *

quickumls/about.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# inspired from:
2+
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
3+
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
4+
# https://github.com/explosion/spaCy/blob/master/spacy/about.py
5+
6+
__title__ = 'quickumls'
7+
__version__ = '1.3.0r4'
8+
__author__ = 'Luca Soldaini'
9+
__email__ = '[email protected]'
10+
__license__ = 'MIT'
11+
__uri__ = "https://github.com/Georgetown-IR-Lab/QuickUMLS"
12+
__copyright__ = '2014-2019, Georgetown University Information Retrieval Lab'

client.py renamed to quickumls/client.py

+2-6
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
try:
2-
from network import MinimalClient
3-
from quickumls import QuickUMLS
4-
except ImportError:
5-
from .network import MinimalClient
6-
from .quickumls import QuickUMLS
1+
from .network import MinimalClient
2+
from .core import QuickUMLS
73

84

95
def get_quickumls_client(host='localhost', port=4645):
File renamed without changes.

quickumls.py renamed to quickumls/core.py

+19-22
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,8 @@
1414
from unidecode import unidecode
1515

1616
# project modules
17-
try:
18-
import toolbox
19-
import constants
20-
except ImportError:
21-
from . import toolbox
22-
from . import constants
17+
from . import toolbox
18+
from . import constants
2319

2420

2521
class QuickUMLS(object):
@@ -32,25 +28,25 @@ def __init__(
3228
accepted_semtypes=constants.ACCEPTED_SEMTYPES,
3329
verbose=False):
3430
"""Instantiate QuickUMLS object
35-
31+
3632
This is the main interface through which text can be processed.
3733
3834
Args:
3935
quickumls_fp (str): Path to which QuickUMLS was installed
40-
overlapping_criteria (str, optional):
41-
One of "score" or "length". Choose how results are ranked.
36+
overlapping_criteria (str, optional):
37+
One of "score" or "length". Choose how results are ranked.
4238
Choose "score" for best matching score first or "length" for longest match first.. Defaults to 'score'.
4339
threshold (float, optional): Minimum similarity between strings. Defaults to 0.7.
4440
window (int, optional): Maximum amount of tokens to consider for matching. Defaults to 5.
45-
similarity_name (str, optional): One of "dice", "jaccard", "cosine", or "overlap".
41+
similarity_name (str, optional): One of "dice", "jaccard", "cosine", or "overlap".
4642
Similarity measure to be used. Defaults to 'jaccard'.
4743
min_match_length (int, optional): TODO: ??. Defaults to 3.
4844
accepted_semtypes (List[str], optional): Set of UMLS semantic types concepts should belong to.
4945
Semantic types are identified by the letter "T" followed by three numbers
50-
(e.g., "T131", which identifies the type "Hazardous or Poisonous Substance").
46+
(e.g., "T131", which identifies the type "Hazardous or Poisonous Substance").
5147
Defaults to constants.ACCEPTED_SEMTYPES.
5248
verbose (bool, optional): TODO:??. Defaults to False.
53-
49+
5450
Raises:
5551
ValueError: Raises a ValueError if QuickUMLS was installed for a language that is not currently supported TODO: verify this?
5652
OSError: Raises an OSError if the required Spacy model was not installed.
@@ -123,10 +119,6 @@ def __init__(
123119

124120
self.accepted_semtypes = accepted_semtypes
125121

126-
self.ss_db = toolbox.SimstringDBReader(
127-
simstring_fp, similarity_name, threshold
128-
)
129-
self.cuisem_db = toolbox.CuiSemTypesDB(cuisem_fp)
130122
try:
131123
self.nlp = spacy.load(spacy_lang)
132124
except OSError:
@@ -139,10 +131,15 @@ def __init__(
139131
constants.SPACY_LANGUAGE_MAP.get(self.language_flag, 'xx')
140132
)
141133
raise OSError(msg)
134+
135+
self.ss_db = toolbox.SimstringDBReader(
136+
simstring_fp, similarity_name, threshold
137+
)
138+
self.cuisem_db = toolbox.CuiSemTypesDB(cuisem_fp)
142139

143140
def get_info(self):
144141
"""Computes a summary of the matcher options.
145-
142+
146143
Returns:
147144
Dict: Dictionary containing information on the QuicUMLS instance.
148145
"""
@@ -372,10 +369,10 @@ def _make_token_sequences(self, parsed):
372369
for j in xrange(
373370
i + 1, min(i + self.window, len(parsed)) + 1):
374371
span = parsed[i:j]
375-
372+
376373
if not self._is_longer_than_min(span):
377374
continue
378-
375+
379376
yield (span.start_char, span.end_char, span.text)
380377

381378
def _print_verbose_status(self, parsed, matches):
@@ -396,18 +393,18 @@ def match(self, text, best_match=True, ignore_syntax=False):
396393
"""Perform UMLS concept resolution for the given string.
397394
398395
[extended_summary]
399-
396+
400397
Args:
401398
text (str): Text on which to run the algorithm
402399
403400
best_match (bool, optional): Whether to return only the top match or all overlapping candidates. Defaults to True.
404401
ignore_syntax (bool, optional): Wether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False.
405-
402+
406403
Returns:
407404
List: List of all matches in the text
408405
TODO: Describe format
409406
"""
410-
407+
411408
parsed = self.nlp(u'{}'.format(text))
412409

413410
if ignore_syntax:

install.py renamed to quickumls/install.py

+33-28
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
from six.moves import input
1111

1212
# project modules
13-
from toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
14-
from constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES
13+
from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
14+
from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES
1515

1616
try:
1717
from unidecode import unidecode
@@ -116,7 +116,36 @@ def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
116116
cuisty_db.insert(term, cui, stys, preferred)
117117

118118

119-
def driver(opts):
119+
def parse_args():
120+
ap = argparse.ArgumentParser()
121+
ap.add_argument(
122+
'umls_installation_path',
123+
help=('Location of UMLS installation files (`MRCONSO.RRF` and '
124+
'`MRSTY.RRF` files)')
125+
)
126+
ap.add_argument(
127+
'destination_path',
128+
help='Location where the necessary QuickUMLS files are installed'
129+
)
130+
ap.add_argument(
131+
'-L', '--lowercase', action='store_true',
132+
help='Consider only lowercase version of tokens'
133+
)
134+
ap.add_argument(
135+
'-U', '--normalize-unicode', action='store_true',
136+
help='Normalize unicode strings to their closest ASCII representation'
137+
)
138+
ap.add_argument(
139+
'-E', '--language', default='ENG', choices=LANGUAGES,
140+
help='Extract concepts of the specified language'
141+
)
142+
opts = ap.parse_args()
143+
return opts
144+
145+
146+
def main():
147+
opts = parse_args()
148+
120149
if not os.path.exists(opts.destination_path):
121150
msg = ('Directory "{}" does not exists; should I create it? [y/N] '
122151
''.format(opts.destination_path))
@@ -172,28 +201,4 @@ def driver(opts):
172201

173202

174203
if __name__ == '__main__':
175-
ap = argparse.ArgumentParser()
176-
ap.add_argument(
177-
'umls_installation_path',
178-
help=('Location of UMLS installation files (`MRCONSO.RRF` and '
179-
'`MRSTY.RRF` files)')
180-
)
181-
ap.add_argument(
182-
'destination_path',
183-
help='Location where the necessary QuickUMLS files are installed'
184-
)
185-
ap.add_argument(
186-
'-L', '--lowercase', action='store_true',
187-
help='Consider only lowercase version of tokens'
188-
)
189-
ap.add_argument(
190-
'-U', '--normalize-unicode', action='store_true',
191-
help='Normalize unicode strings to their closest ASCII representation'
192-
)
193-
ap.add_argument(
194-
'-E', '--language', default='ENG', choices=LANGUAGES,
195-
help='Extract concepts of the specified language'
196-
)
197-
opts = ap.parse_args()
198-
199-
driver(opts)
204+
main()
File renamed without changes.

server.py renamed to quickumls/server.py

+13-9
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,8 @@
1-
try:
2-
from quickumls import QuickUMLS
3-
from network import run_server
4-
except ImportError:
5-
from .quickumls import QuickUMLS
6-
from .MinimalServer import run_server
7-
81
from argparse import ArgumentParser
92

3+
from .core import QuickUMLS
4+
from .network import run_server
5+
106

117
def run_quickumls_server(opts):
128
matcher = QuickUMLS(
@@ -22,7 +18,7 @@ def run_quickumls_server(opts):
2218
run_server(matcher, host=opts.host, port=opts.port, buffersize=4096)
2319

2420

25-
if __name__ == '__main__':
21+
def parse_args():
2622
ap = ArgumentParser(
2723
prog='QuickUMLS server',
2824
description=(
@@ -76,5 +72,13 @@ def run_quickumls_server(opts):
7672
help='return verbose information while running'
7773
)
7874

79-
opts = ap.parse_args()
75+
return ap.parse_args()
76+
77+
78+
def main():
79+
opts = parse_args()
8080
run_quickumls_server(opts)
81+
82+
83+
if __name__ == '__main__':
84+
main()

toolbox.py renamed to quickumls/toolbox.py

+2-4
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,8 @@
1414
import leveldb
1515

1616
# project imports
17-
try:
18-
from simstring import simstring
19-
except ImportError:
20-
from .simstring import simstring
17+
from quickumls_simstring import simstring
18+
2119

2220
# Python version specific imports
2321
if six.PY2:

requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ numpy>=1.8.2
33
spacy>=1.6.0
44
unidecode>=0.4.19
55
nltk>=3.3
6+
quickumls_simstring>=1.1.5r1

0 commit comments

Comments
 (0)