You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral.
10
16
11
17
For more details, see the following resources:
@@ -34,110 +40,112 @@ for dependency management. Please make sure it is installed.
34
40
Then, follow below instructions to set up your virtual environment.
35
41
36
42
1. Create a venv with python3.10 and activate it.
37
-
bash
43
+
```bash
38
44
python --version # must print 3.10
39
45
python -m venv <your-venv-name>
40
46
source<your-venv-name>/bin/activate
41
-
47
+
```
42
48
43
49
2. Navigate to the root directory of pmc-data-extraction repository and install dependencies.
44
50
Two of the required dependencies are [mmlearn](https://github.com/VectorInstitute/mmlearn) and [open_clip](https://github.com/mlfoundations/open_clip).
45
-
You have the option to either install them with pip or from source.
51
+
You have the option to either install them with `pip` or from source.
46
52
47
-
To install mmlearn and open_clip with pip, run
48
-
bash
53
+
To install `mmlearn` and `open_clip` with `pip`, run
To install `mmlearn` and `open_clip` from source, run
62
+
```bash
57
63
cd path/to/pmc-data-extraction
58
64
pip install --upgrade pip
59
65
poetry install --no-root --with test --all-extras
66
+
```
67
+
The above command assumes that you would install `mmlearn` or `open_clip` packages from source using the submodules found in `pmc-data-extraction/openpmcvl/`experiment.
60
68
61
-
The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/Open-PMCvl/experiment.
62
-
63
-
3. Clone mmlearn and open_clip submodules.
64
-
bash
69
+
3. Clone `mmlearn` and `open_clip` submodules.
70
+
```bash
65
71
git submodule init
66
72
git submodule update
73
+
```
74
+
You should see the source files inside `pmc-data-extraction/openpmcvl/experiment/open_clip` and `pmc-data-extraction/openpmcvl/experiment/mmlearn`.
67
75
68
-
You should see the source files inside pmc-data-extraction/Open-PMCvl/experiment/open_clip and pmc-data-extraction/Open-PMCvl/experiment/mmlearn.
69
-
70
-
4. Install mmlearn from source.
71
-
bash
72
-
cd Open-PMCvl/experiment/mmlearn
76
+
4. Install `mmlearn` from source.
77
+
```bash
78
+
cd openpmcvl/experiment/mmlearn
73
79
python3 -m pip install -e .
80
+
```
74
81
75
-
76
-
5. Install open_clip from source.
77
-
bash
82
+
5. Install `open_clip` from source.
83
+
```bash
78
84
cd ../open_clip
79
85
make install
80
86
make install-training
81
-
87
+
```
82
88
83
89
6. Check installations.
84
-
bash
90
+
```bash
85
91
pip freeze | grep mmlearn
86
92
pip freeze | grep open_clip
87
93
python
88
94
> import mmlearn
89
95
> import open_clip
90
96
> mmlearn.__file__
91
97
> open_clip.__file__
98
+
```
92
99
100
+
**Note:** Since these submodules (`mmlearn` and `open_clip`) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.
93
101
94
-
**Note:** Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.
95
102
96
103
## Download and parse image-caption pairs from Pubmed Articles
97
-
The codebase used to download Pubmed articles and parse image-text pairs from them is stored in Open-PMCvl/foundation.
104
+
The codebase used to download Pubmed articles and parse image-text pairs from them is stored in `openpmcvl/foundation`.
98
105
This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1).
99
106
To download and parse articles with licenses that allow commercial use, run
100
-
bash
107
+
```bash
101
108
# activate virtual environment
102
109
source /path/to/your/venv/bin/activate
103
110
# navigate to root directory of the package
104
-
cd Open-PMCvl/foundation
111
+
cdopenpmcvl/foundation
105
112
# download all 11 volumes with commercailly usable license
Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval.
137
145
An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below:
0 commit comments