Skip to content

Commit 562812c

Browse files
authored
Merge pull request #7 from Georgetown-IR-Lab/v1.2
V1.2
2 parents 2f86356 + a5cfe16 commit 562812c

File tree

8 files changed

+416
-80
lines changed

8 files changed

+416
-80
lines changed

README.md

+57-13
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
**We recommend to download the latest tested version form the [release section](https://github.com/Georgetown-IR-Lab/QuickUMLS/releases)**.
1+
**We recommend to download the latest tested version from the [releases section](https://github.com/Georgetown-IR-Lab/QuickUMLS/releases)**.
2+
3+
**NEW: v.1.2 now includes client/server support!** Start a QuickUMLS server once, avoid loading QuickUMLS each time your experiments run! See <a href="#client_server">below</a> for more info.
24

35
# QuickUMLS
46

@@ -13,43 +15,85 @@ This project should be compatible with both Python 2 and 3 and run on any UNIX s
1315
#### Before Starting
1416

1517
1. Make sure that your Python installation include C headers (e.g., on Ubuntu, make sure `python3-dev` or `python-dev` are installed).
16-
2. This software requires all packages listed in the requirements.txt file. You can install all of them by running `pip install -r requirements.txt`.
18+
2. This software requires all packages listed in the `requirements.txt` file. You can install all of them by running `pip install -r requirements.txt`.
1719
3. Note that, in order to use `spacy`, you are required to download its corpus. You can do that by running `python -m spacy.en.download`.
1820
4. This system requires you to have a valid UMLS installation on disk. The installation can be remove once the system has been initialized.
1921

20-
#### To get the System Running
22+
#### How To get the System Initialized
2123

2224
1. Download and compile Simstring by running `bash setup_simstring.sh <python_version>`, where `<python_version>` is either "`2`" or "`3`".
23-
2. Initialize the system by running `python install.py <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast is the drive where UMLS and QuickUMLS files are stored.
25+
2. Initialize the system by running `python install.py <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200RPM hard drive, initialization takes 8.5 minutes).
26+
27+
`install.py` supports the following optional arguments:
28+
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
29+
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
30+
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
2431

2532
## APIs
2633

2734
A QuickUMLS object can be instantiated as follows:
2835

2936
```python
30-
>>> matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
31-
similarity_name, window, accepted_semtypes)
37+
matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
38+
similarity_name, window, accepted_semtypes)
3239
```
3340

3441
Where:
3542

3643
- `quickumls_fp` is the directory where the QuickUMLS data files are installed.
37-
- `overlapping_criteria` (default: "score") is the criteria used to deal with overlapping concepts; choose "score" if the matching score of the concepts should be consider first, "length" if the longest should be considered first instead.
38-
- `threshold` (default: 0.7) is the minimum similarity value between strings.
39-
- `similarity_name` (default: "jaccard") is the name of similarity to use. Choose between "dice", "jaccard", "cosine", or "overlap".
40-
- `window` (default: 5) is the maximum number of tokens to consider for matching.s
41-
- `accepted_semtypes` (default: see `constants.py`) is the set of UMLS semantic types concepts should belong to. Semantic types are identified by the letter "T" followed by three numbers (e.g., "T131", which identifies the type *"Hazardous or Poisonous Substance"*). See [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2013AA.txt) for the full list.
44+
- `overlapping_criteria` (optional, default: "score") is the criteria used to deal with overlapping concepts; choose "score" if the matching score of the concepts should be consider first, "length" if the longest should be considered first instead.
45+
- `threshold` (optional, default: 0.7) is the minimum similarity value between strings.
46+
- `similarity_name` (optional, default: "jaccard") is the name of similarity to use. Choose between "dice", "jaccard", "cosine", or "overlap".
47+
- `window` (optional, default: 5) is the maximum number of tokens to consider for matching.
48+
- `accepted_semtypes` (optional, default: see `constants.py`) is the set of UMLS semantic types concepts should belong to. Semantic types are identified by the letter "T" followed by three numbers (e.g., "T131", which identifies the type *"Hazardous or Poisonous Substance"*). See [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2013AA.txt) for the full list.
4249

4350
To use the matcher, simply call
4451

4552
```python
46-
>>> text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
47-
>>> matcher.match(text, best_match=True, ignore_syntax=False)
53+
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
54+
matcher.match(text, best_match=True, ignore_syntax=False)
4855
```
4956

5057
Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).
5158

5259

60+
<h2 id="client_server">[NEW] Server / Client Support</h2>
61+
62+
Starting with v.1.2, QuickUMLS includes a support for being used in a client-server configuration. That is, you can start one QuickUMLS server, and query it from multiple scripts using a client.
63+
64+
To start the server, run `server.py`:
65+
66+
```bash
67+
python server.py /path/to/quickumls/files {-P QuickUMLS port} {-H QuickUMLS host} {QuickUMLS options}
68+
```
69+
70+
Host and port are optional; by default, QuickUMLS runs on `localhost:4645`. You can also pass any QuickUMLS option mentioned above to the server. To obtain a list of options for the server, run `python server.py -h`.
71+
72+
To load the client, import `get_quickumls_client` from `client.py`:
73+
74+
```bash
75+
from client import get_quickumls_client
76+
matcher = get_quickumls_client()
77+
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
78+
matcher.match(text, best_match=True, ignore_syntax=False)
79+
```
80+
81+
The API of the client is the same of a QuickUMLS object.
82+
83+
84+
In case you wish to run the server in the background, you can do so as follows:
85+
86+
```bash
87+
nohup python server.py /path/to/QuickUMLS {server options} > /dev/null 2>&1 & echo $! > nohup.pid
88+
89+
```
90+
91+
When you are done, don't forget to stop the server by running.
92+
```bash
93+
kill -9 `cat nohup.pid`
94+
rm nohup.pid
95+
```
96+
5397
## References
5498

5599
- Okazaki, Naoaki, and Jun'ichi Tsujii. "*Simple and efficient algorithm for approximate dictionary matching.*" COLING 2010.

client.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
try:
2+
from network import MinimalClient
3+
from quickumls import QuickUMLS
4+
except ImportError:
5+
from .network import MinimalClient
6+
from .quickumls import QuickUMLS
7+
8+
9+
def get_quickumls_client(host='localhost', port=4645):
10+
'''Return a client for a QuickUMLS server running on host at port'''
11+
client = MinimalClient(QuickUMLS, host=host, port=port, buffersize=4096)
12+
return client

constants.py

+28
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,31 @@
4545
u'\u3030', u'\u30a0', u'\ufe31', u'\ufe32', u'\ufe58', u'\ufe63',
4646
u'\uff0d'
4747
}
48+
49+
LANGUAGES = {
50+
'BAQ', #Basque
51+
'CHI', #Chinese
52+
'CZE', #Czech
53+
'DAN', #Danish
54+
'DUT', #Dutch
55+
'ENG', #English
56+
'EST', #Estonian
57+
'FIN', #Finnish
58+
'FRE', #French
59+
'GER', #German
60+
'GRE', #Greek
61+
'HEB', #Hebrew
62+
'HUN', #Hungarian
63+
'ITA', #Italian
64+
'JPN', #Japanese
65+
'KOR', #Korean
66+
'LAV', #Latvian
67+
'NOR', #Norwegian
68+
'POL', #Polish
69+
'POR', #Portuguese
70+
'RUS', #Russian
71+
'SCR', #Croatian
72+
'SPA', #Spanish
73+
'SWE', #Swedish
74+
'TUR', #Turkish
75+
}

docs/_config.yml

-5
This file was deleted.

docs/index.md

-54
This file was deleted.

install.py

+21-8
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
# project modules
1212
from toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
13-
from constants import HEADERS_MRCONSO, HEADERS_MRSTY
13+
from constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES
1414

1515
try:
1616
from unidecode import unidecode
@@ -29,12 +29,12 @@ def get_semantic_types(path, headers):
2929
return sem_types
3030

3131

32-
def get_mrconso_iterator(path, headers):
32+
def get_mrconso_iterator(path, headers, lang='ENG'):
3333
with codecs.open(path, encoding='utf-8') as f:
3434
for i, ln in enumerate(f):
3535
content = dict(zip(headers, ln.strip().split('|')))
3636

37-
if content['lat'] != 'ENG':
37+
if content['lat'] != lang:
3838
continue
3939

4040
yield content
@@ -52,19 +52,23 @@ def extract_from_mrconso(
5252

5353
start = time.time()
5454

55-
mrconso_iterator = get_mrconso_iterator(mrconso_path, mrconso_header)
55+
mrconso_iterator = get_mrconso_iterator(
56+
mrconso_path, mrconso_header, opts.language
57+
)
5658

5759
total = countlines(mrconso_path)
5860

5961
processed = set()
62+
i = 0
6063

61-
for i, content in enumerate(mrconso_iterator, start=1):
64+
for content in mrconso_iterator:
65+
i += 1
6266

6367
if i % 100000 == 0:
6468
delta = time.time() - start
6569
status = (
6670
'{:,} in {:.2f} s ({:.2%}, {:.1e} s / term)'
67-
''.format(i, delta, i / total, delta / i)
71+
''.format(i, delta, i / total, delta / i if i > 0 else 0)
6872
)
6973
print(status)
7074

@@ -85,6 +89,13 @@ def extract_from_mrconso(
8589

8690
yield (concept_text, cui, sem_types[cui], preferred)
8791

92+
delta = time.time() - start
93+
status = (
94+
'\nCOMPLETED: {:,} in {:.2f} s ({:.1e} s / term)'
95+
''.format(i, delta, i / total, delta / i if i > 0 else 0)
96+
)
97+
print(status)
98+
8899

89100
def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
90101
# Create destination directories for the two databases
@@ -154,8 +165,6 @@ def driver(opts):
154165

155166
parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir)
156167

157-
print('Completed!')
158-
159168

160169
if __name__ == '__main__':
161170
ap = argparse.ArgumentParser()
@@ -176,6 +185,10 @@ def driver(opts):
176185
'-U', '--normalize-unicode', action='store_true',
177186
help='Normalize unicode strings to their closest ASCII representation'
178187
)
188+
ap.add_argument(
189+
'-E', '--language', default='ENG', choices=LANGUAGES,
190+
help='Extract concepts of the specified language'
191+
)
179192
opts = ap.parse_args()
180193

181194
driver(opts)

0 commit comments

Comments
 (0)