Skip to content

Commit 73b7ee0

Browse files
authored
Update README.md
1 parent 830395f commit 73b7ee0

File tree

1 file changed

+48
-39
lines changed

1 file changed

+48
-39
lines changed

README.md

Lines changed: 48 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@
66
[![integration tests](https://github.com/VectorInstitute/aieng-template/actions/workflows/integration_tests.yml/badge.svg)](https://github.com/VectorInstitute/pmc-data-extraction/actions/workflows/integration_tests.yml)
77
[![license](https://img.shields.io/github/license/VectorInstitute/aieng-template.svg)](https://github.com/VectorInstitute/pmc-data-extraction/blob/main/LICENSE.md)
88

9+
<div align="center">
10+
<img src="https://github.com/VectorInstitute/pmc-data-extraction/blob/0a969136344a07267bb558d01f3fe76b36b93e1a/media/open-pmc-pipeline.png?raw=true"
11+
alt="Open-PMC Pipeline"
12+
width="1000" />
13+
</div>
14+
915
A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral.
1016

1117
For more details, see the following resources:
@@ -34,110 +40,112 @@ for dependency management. Please make sure it is installed.
3440
Then, follow below instructions to set up your virtual environment.
3541

3642
1. Create a venv with python3.10 and activate it.
37-
bash
43+
```bash
3844
python --version # must print 3.10
3945
python -m venv <your-venv-name>
4046
source <your-venv-name>/bin/activate
41-
47+
```
4248

4349
2. Navigate to the root directory of pmc-data-extraction repository and install dependencies.
4450
Two of the required dependencies are [mmlearn](https://github.com/VectorInstitute/mmlearn) and [open_clip](https://github.com/mlfoundations/open_clip).
45-
You have the option to either install them with pip or from source.
51+
You have the option to either install them with `pip` or from source.
4652

47-
To install mmlearn and open_clip with pip, run
48-
bash
53+
To install `mmlearn` and `open_clip` with `pip`, run
54+
```bash
4955
cd path/to/pmc-data-extraction
5056
pip install --upgrade pip
5157
poetry install --no-root --with test,open_clip,mmlearn --all-extras
52-
58+
```
5359
then skip to step 6: Check Installations.
5460

55-
To install mmlearn and open_clip from source, run
56-
bash
61+
To install `mmlearn` and `open_clip` from source, run
62+
```bash
5763
cd path/to/pmc-data-extraction
5864
pip install --upgrade pip
5965
poetry install --no-root --with test --all-extras
66+
```
67+
The above command assumes that you would install `mmlearn` or `open_clip` packages from source using the submodules found in `pmc-data-extraction/openpmcvl/`experiment.
6068

61-
The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/Open-PMCvl/experiment.
62-
63-
3. Clone mmlearn and open_clip submodules.
64-
bash
69+
3. Clone `mmlearn` and `open_clip` submodules.
70+
```bash
6571
git submodule init
6672
git submodule update
73+
```
74+
You should see the source files inside `pmc-data-extraction/openpmcvl/experiment/open_clip` and `pmc-data-extraction/openpmcvl/experiment/mmlearn`.
6775

68-
You should see the source files inside pmc-data-extraction/Open-PMCvl/experiment/open_clip and pmc-data-extraction/Open-PMCvl/experiment/mmlearn.
69-
70-
4. Install mmlearn from source.
71-
bash
72-
cd Open-PMCvl/experiment/mmlearn
76+
4. Install `mmlearn` from source.
77+
```bash
78+
cd openpmcvl/experiment/mmlearn
7379
python3 -m pip install -e .
80+
```
7481

75-
76-
5. Install open_clip from source.
77-
bash
82+
5. Install `open_clip` from source.
83+
```bash
7884
cd ../open_clip
7985
make install
8086
make install-training
81-
87+
```
8288

8389
6. Check installations.
84-
bash
90+
```bash
8591
pip freeze | grep mmlearn
8692
pip freeze | grep open_clip
8793
python
8894
> import mmlearn
8995
> import open_clip
9096
> mmlearn.__file__
9197
> open_clip.__file__
98+
```
9299

100+
**Note:** Since these submodules (`mmlearn` and `open_clip`) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.
93101

94-
**Note:** Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.
95102

96103
## Download and parse image-caption pairs from Pubmed Articles
97-
The codebase used to download Pubmed articles and parse image-text pairs from them is stored in Open-PMCvl/foundation.
104+
The codebase used to download Pubmed articles and parse image-text pairs from them is stored in `openpmcvl/foundation`.
98105
This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1).
99106
To download and parse articles with licenses that allow commercial use, run
100-
bash
107+
```bash
101108
# activate virtual environment
102109
source /path/to/your/venv/bin/activate
103110
# navigate to root directory of the package
104-
cd Open-PMCvl/foundation
111+
cd openpmcvl/foundation
105112
# download all 11 volumes with commercailly usable license
106113
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11
107-
114+
```
108115
To download and parse open-access articles which are not allowed commercial use, run
109-
bash
116+
```bash
110117
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/noncommercial --license-type noncomm --volumes 1 2 3 4 5 6 7 8 9 10 11
111-
118+
```
112119
To download and parse open-access articles which other licenses than what is mentioned above, run
113-
bash
120+
```bash
114121
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11
122+
```
115123

116124

117125
## Run Benchmarking Experiments
118-
We use mmlearn to run benchmarking experiments.
119-
Many experiments can be run with our dataset and mmlearn.
126+
We use `mmlearn` to run benchmarking experiments.
127+
Many experiments can be run with our dataset and `mmlearn`.
120128
A simple example of training with our dataset is given below:
121-
bash
129+
```bash
122130
# navigate to root directory of the repository
123131
cd pmc-data-extraction
124132
# set pythonpath
125133
export PYTHONPATH="./"
126134
# run training experiment
127135
mmlearn_run \
128-
'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \
136+
'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
129137
+experiment=pmcoa2_matched \
130138
experiment_name=pmcoa2_matched_train \
131139
dataloader.train.batch_size=256 \
132140
task.encoders.text.pretrained=False \
133141
task.encoders.rgb.pretrained=False
134-
142+
```
135143

136144
Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval.
137145
An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below:
138-
bash
146+
```bash
139147
mmlearn_run \
140-
'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \
148+
'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
141149
+experiment=pmcoa2_matched \
142150
experiment_name=pmcoa2_matched_retrieval_mimic \
143151
job_type=eval \
@@ -148,9 +156,10 @@ mmlearn_run \
148156
datasets.test.mimic.transform.job_type=eval \
149157
dataloader.test.batch_size=64 \
150158
resume_from_checkpoint="path/to/model/checkpoint"
159+
```
160+
For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to `openpmcvl/experiment/scripts`.
161+
For more information about `mmlearn`, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn).
151162

152-
For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to Open-PMCvl/experiment/scripts.
153-
For more information about mmlearn, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn).
154163

155164
## Citation
156165
If you find the code useful for your research, please consider citing

0 commit comments

Comments
 (0)