Skip to content

Commit eab9b91

Browse files
committed
quick readme updates
1 parent 6aa6f8c commit eab9b91

File tree

1 file changed

+44
-1
lines changed

1 file changed

+44
-1
lines changed

README.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,49 @@
1+
# KM Indexer
2+
3+
This workflow aims to convert MEDLINE/PubMed citation records into an indexed,
4+
text-searchable Elasticsearch database. This can be done on a subset of the
5+
available data, for search logic testing within the context of a KinderMiner
6+
application.
7+
8+
It can be run on the annual baseline files and/or the MEDLINE-provided daily
9+
updates, by providing either the `bulk` or `update` argument when invoking the
10+
`index_pubmed.py` script.
11+
12+
The indexed Elasticsearch data is persisted by storing the data in a local
13+
directory mounted into the running image.
14+
15+
Port `9200` on the local host is forwarded to the running Elasticsearch image.
16+
17+
Documents use the PMID as the `_id` primary key.
18+
19+
20+
## Usage
21+
Basic usage requires docker and docker compose. Create a local directory for
22+
the Elasticsearch data and run + build the docker images:
23+
124
```
225
mkdir es_data
326
docker-compose up --build
427
```
5-
628
Now you can run ES queries against on the host machine: `curl -X GET http://localhost:9200/pubmed_abstracts/_count`
29+
30+
### 'bulk' option
31+
Running the `index_pubmed.py` script with the `bulk` option will download + index
32+
the dumps provided in the MEDLINE Annual Baseline
33+
(`https://www.nlm.nih.gov/databases/download/pubmed_medline.html`)
34+
There are two optional parameters, `n_min` and `n_max`, which are the minimum
35+
and maximum file number to process. By default, only the first file is fetched
36+
and processed.
37+
38+
### Update files
39+
It is also possible to ingest the daily update files provided by MEDLINE
40+
(`ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/`). **BY DEFAULT, ALL UPDATE
41+
FILES WILL BE APPLIED IN THIS MODE**
42+
43+
## Caveats
44+
- The intended use is for testing of query logic, and the JVM options set for
45+
Elasticsearch are set with this in mind.
46+
- There is rudimentary checkpointing applied when running in the update files
47+
mode, but is non-persistent across image restarts. This means that if you
48+
need to restart the image, the data within Elasticsearch will still be there,
49+
but the update files will all be redownloaded and updated.

0 commit comments

Comments
 (0)