|
| 1 | +# KM Indexer |
| 2 | + |
| 3 | +This workflow aims to convert MEDLINE/PubMed citation records into an indexed, |
| 4 | +text-searchable Elasticsearch database. This can be done on a subset of the |
| 5 | +available data, for search logic testing within the context of a KinderMiner |
| 6 | +application. |
| 7 | + |
| 8 | +It can be run on the annual baseline files and/or the MEDLINE-provided daily |
| 9 | +updates, by providing either the `bulk` or `update` argument when invoking the |
| 10 | +`index_pubmed.py` script. |
| 11 | + |
| 12 | +The indexed Elasticsearch data is persisted by storing the data in a local |
| 13 | +directory mounted into the running image. |
| 14 | + |
| 15 | +Port `9200` on the local host is forwarded to the running Elasticsearch image. |
| 16 | + |
| 17 | +Documents use the PMID as the `_id` primary key. |
| 18 | + |
| 19 | + |
| 20 | +## Usage |
| 21 | +Basic usage requires docker and docker compose. Create a local directory for |
| 22 | +the Elasticsearch data and run + build the docker images: |
| 23 | + |
1 | 24 | ``` |
2 | 25 | mkdir es_data |
3 | 26 | docker-compose up --build |
4 | 27 | ``` |
5 | | - |
6 | 28 | Now you can run ES queries against on the host machine: `curl -X GET http://localhost:9200/pubmed_abstracts/_count` |
| 29 | + |
| 30 | +### 'bulk' option |
| 31 | +Running the `index_pubmed.py` script with the `bulk` option will download + index |
| 32 | +the dumps provided in the MEDLINE Annual Baseline |
| 33 | +(`https://www.nlm.nih.gov/databases/download/pubmed_medline.html`) |
| 34 | +There are two optional parameters, `n_min` and `n_max`, which are the minimum |
| 35 | +and maximum file number to process. By default, only the first file is fetched |
| 36 | +and processed. |
| 37 | + |
| 38 | +### Update files |
| 39 | +It is also possible to ingest the daily update files provided by MEDLINE |
| 40 | +(`ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/`). **BY DEFAULT, ALL UPDATE |
| 41 | +FILES WILL BE APPLIED IN THIS MODE** |
| 42 | + |
| 43 | +## Caveats |
| 44 | +- The intended use is for testing of query logic, and the JVM options set for |
| 45 | + Elasticsearch are set with this in mind. |
| 46 | +- There is rudimentary checkpointing applied when running in the update files |
| 47 | + mode, but is non-persistent across image restarts. This means that if you |
| 48 | + need to restart the image, the data within Elasticsearch will still be there, |
| 49 | + but the update files will all be redownloaded and updated. |
0 commit comments