Skip to content

Commit a3469bd

Browse files
committed
Add abbreviation expansion toggle
1 parent f8d42e8 commit a3469bd

File tree

5 files changed

+22
-8
lines changed

5 files changed

+22
-8
lines changed

.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*data

.env

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# set to 1 to expand abbreviations via ALLIE (http://allie.dbcls.jp)
2+
3+
EXPAND_ABBREVIATIONS=1

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,17 @@ It is also possible to ingest the daily update files provided by MEDLINE
4040
(`ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/`). **BY DEFAULT, ALL UPDATE
4141
FILES WILL BE APPLIED IN THIS MODE**
4242

43+
## Abbreviation expansion
44+
Abberviation expansion is done via the ALLIE (http://allie.dbcls.jp) database.
45+
By default, abbrevations are kept as-is from PubMed, but by changing the setting in `.env`
46+
to
47+
48+
```
49+
EXPAND_ABBREVIATIONS=1
50+
```
51+
52+
The ALLIE database will be downloaded and installed into a postgres table. As the PubMed abstracts are ingested, this database is queried and any abbreviations found within the abstract are replaced with the long form, and the result is stored within the `abstract_long_form` field.
53+
4354
## Caveats
4455
- The intended use is for testing of query logic, and the JVM options set for
4556
Elasticsearch are set with this in mind.

docker-compose.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,12 @@ services:
3030
build: .
3131
networks:
3232
- kmnet
33-
# command: "tail -f /dev/null"
34-
command: "wait-for-it -s es01:9200 -s km_postgres:5432 -- python index_pubmed.py bulk --n_min 1 --n_max 5"
33+
command: "wait-for-it -s es01:9200 -s km_postgres:5432 -- python index_pubmed.py bulk --n_min 1 --n_max 1"
3534
depends_on:
3635
- es01
3736
environment:
3837
- PYTHONUNBUFFERED=1
38+
- EXPAND_ABBREVIATIONS=${EXPAND_ABBREVIATIONS}
3939

4040
postgres:
4141
container_name: km_postgres

index_pubmed.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
import urllib.request as urllib
1414
import pickle
1515

16-
DO_ABBREVIATIONS = True
16+
EXPAND_ABBREVIATIONS = True if os.environ['EXPAND_ABBREVIATIONS'] == '1' else False
1717

1818
es = Elasticsearch(['es01:9200'])
1919

@@ -280,7 +280,7 @@ def get_metadata_from_xml(self, filepath):
280280

281281
temp["metadata_update"] = datetime.datetime.now()
282282

283-
if DO_ABBREVIATIONS:
283+
if EXPAND_ABBREVIATIONS:
284284
print("Checking for abbreviations")
285285
self.cur.execute("SELECT DISTINCT(short_form, long_form), short_form, long_form FROM alice_abbreviations WHERE pubmed_id=%(pmid)s",
286286
{"pmid" : temp["PMID"]})
@@ -396,10 +396,9 @@ def main():
396396
parser.add_argument('--n_min', default=1, type=int, help='Minimum file number to process.')
397397
parser.add_argument('--n_max', default=1, type=int, help='Maximum file number to process.')
398398

399-
400-
# TODO: pass + do in abbreviation embiggening
401-
#if DO_ABBREVIATIONS:
402-
#download_allie()
399+
if EXPAND_ABBREVIATIONS:
400+
print("Downloading ALLIE abbreviation expansion database...")
401+
download_allie()
403402

404403
if not es.indices.exists("pubmed_abstracts"):
405404
es.indices.create("pubmed_abstracts")

0 commit comments

Comments
 (0)