Skip to content

Commit ae91449

Browse files
authored
Merge pull request #11 from VectorInstitute/add_extended_pipeline
Add extended pipeline
2 parents d8b5d60 + a671e51 commit ae91449

26 files changed

+3986
-3
lines changed

openpmcvl/granular/README.md

Lines changed: 107 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,108 @@
1-
# Granular Package
1+
# **Granular Pipeline**
2+
Our goal is to create a finegrained dataset of biomedical subfigure-subcaption pairs from the raw dataset of PMC figure-caption pairs. We assume that a dataset of PMC figure-caption pairs, e.g. PMC-17M, is already downloaded, formatted as a directory of JSONL files and a directory of image .jpg files. Note that all .sh files require you to pass in the JSONL numbers from the PMC dataset as arguments.
23

3-
This package contains tools to extract sub-figures and sub-captions from downloaded image-caption pairs.
4-
This enlarges the dataset, and may increase the quality of the data as well since the sub-pairs will be more focused and less confusing.
4+
Sample command:
5+
```bash
6+
sbatch openpmcvl/granular/pipeline/preprocess.sh 0 1 2 3 4 5 6 7 8 9 10 11
7+
```
8+
9+
10+
## **1. Preprocess**
11+
> **Code:** `preprocess.py & preprocess.sh` <br>
12+
> **Input:** Directory of figures and PMC metadata in JSONL format <br>
13+
> **Output:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br>
14+
15+
- Filter out figure-caption pairs that are not .jpg images, missing, or corrupted.
16+
- Filter for figure-caption pairs that contain target biomedical keywords.
17+
18+
Each datapoint contains the following fields:
19+
- `id`: A unique identifier for the figure-caption pair.
20+
- `PMC_ID`: The PMC ID of the article.
21+
- `caption`: The caption of the figure.
22+
- `image_path`: The path to the image file.
23+
- `width`: The width of the image in pixels.
24+
- `height`: The height of the image in pixels.
25+
- `media_id`: The ID of the media file.
26+
- `media_url`: The URL of the media file.
27+
- `media_name`: The name of the media file.
28+
- `keywords`: The keywords found in the caption.
29+
- `is_medical`: Whether the caption contains any target biomedical keywords.
30+
<br><br>
31+
32+
This script saves the output both as a directory of processed JSONL files and a merged JSONL file. The former is used in the next step of the pipeline.
33+
<br><br>
34+
35+
36+
## **2. Subfigure Extraction**
37+
> **Code:** `subfigure.py & subfigure.sh` <br>
38+
> **Input:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br>
39+
> **Output:** Directory of subfigure jpg files, and subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br>
40+
41+
- Breakdown compound figures into subfigures.
42+
- Keep original figure for non-compound figures or if an exception occurs.
43+
44+
Each datapoint contains the following fields:
45+
46+
When a subfigure is successfully detected and separated:
47+
- `id`: Unique identifier for the subfigure (format: {source_figure_id}_{subfigure_number}.jpg)
48+
- `source_fig_id`: ID of the original compound figure
49+
- `PMC_ID`: PMC ID of the source article
50+
- `media_name`: Original filename of the compound figure
51+
- `position`: Coordinates of subfigure bounding box [(x1,y1), (x2,y2)]
52+
- `score`: Detection confidence score
53+
- `subfig_path`: Path to saved subfigure image
54+
55+
When subfigure extraction fails:
56+
- `id`: Generated ID that would have been used
57+
- `source_fig_id`: ID of the original figure
58+
- `PMC_ID`: PMC ID of the source article
59+
- `media_name`: Original filename
60+
61+
This script saves extracted subfigures as .jpg files in the target directory. Metadata for each subfigure is stored in separate JSONL files, with unique IDs that link back to the original figure-caption pairs in the source JSONL files.
62+
<br><br>
63+
64+
65+
## **3. Subcaption Extraction**
66+
> **Code:** `subcaption.ipynb | subcaption.py & subcaption.sh` <br>
67+
> **Input:** PMC metadata in JSONL format <br>
68+
> **Output:** PMC metadata in JSONL format with subcaptions <br>
69+
70+
- Extract subcaptions from captions.
71+
- Keep original caption if the caption cannot be split into subcaptions.
72+
73+
While this pipeline works, its slow as it goes through API calls one by one. There is a notebook (`subcaption.ipynb`) using batch API calls to speed it up. It's highly recommended to use the notebook instead of this script.
74+
<br><br>
75+
76+
77+
## **4. Classification**
78+
> **Code:** `classify.py & classify.sh` <br>
79+
> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br>
80+
> **Output:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br>
81+
82+
- Classify subfigures and include metadata about their class.
83+
84+
The following fields are added to each datapoint:
85+
- `is_medical_subfigure`: Whether the subfigure is a medical subfigure.
86+
- `medical_class_rank`: The model's confidence in the medical classification.
87+
88+
This script preserves all subfigures and adds an `is_medical_subfigure` boolean flag to identify medical subfigures. It also includes a `medical_class_rank` field indicating the model's confidence in the medical classification.
89+
<br><br>
90+
91+
92+
## **5. Alignment**
93+
> **Code:** `align.py & align.sh` <br>
94+
> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br>
95+
> **Output:** Aligned subfigure metadata in JSONL format (`${num}_aligned.jsonl`) <br>
96+
97+
- Find the label associated with each subfigure.
98+
- If no label is found, it means either:
99+
- The image is a standalone figure (not part of a compound figure)
100+
- The OCR model failed to detect the subfigure label (e.g. "A", "B", etc.)
101+
102+
The non biomedical subfigures will be removed. The following fields are added to each datapoint:
103+
- `label`: The label associated with the subfigure. (e.g. "Subfigure-A")
104+
- `label_position`: The position of the label in the subfigure.
105+
106+
107+
The outputs from steps 3 and 5 contain labeled subcaptions and labeled subfigures respectively. By matching these labels (e.g. "Subfigure-A"), we can create the final subfigure-subcaption pairs. Any cases where labels are missing or captions couldn't be split will be handled in subsequent steps. Refer to notebook for more details.
108+
<br><br>

openpmcvl/granular/__init__.py

Whitespace-only changes.

openpmcvl/granular/checkpoints/__init__.py

Whitespace-only changes.

openpmcvl/granular/config/__init__.py

Whitespace-only changes.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
MODEL:
2+
TYPE: YOLOv3
3+
BACKBONE: darknet53
4+
ANCHORS: [[6, 7], [9, 10], [10, 14],
5+
[13, 11], [16, 15], [15, 20],
6+
[21, 19], [24, 24], [34, 31]]
7+
ANCH_MASK: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
8+
N_CLASSES: 15
9+
TRAIN:
10+
LR: 0.001
11+
MOMENTUM: 0.9
12+
DECAY: 0.0005
13+
BURN_IN: 1000
14+
MAXITER: 20000
15+
STEPS: (400000, 450000)
16+
BATCHSIZE: 4
17+
SUBDIVISION: 16
18+
IMGSIZE: 608
19+
LOSSTYPE: l2
20+
IGNORETHRE: 0.7
21+
AUGMENTATION:
22+
RANDRESIZE: True
23+
JITTER: 0.3
24+
RANDOM_PLACING: True
25+
HUE: 0.1
26+
SATURATION: 1.5
27+
EXPOSURE: 1.5
28+
LRFLIP: False
29+
RANDOM_DISTORT: True
30+
TEST:
31+
CONFTHRE: 0.8
32+
NMSTHRE: 0.1
33+
IMGSIZE: 416
34+
NUM_GPUS: 1

openpmcvl/granular/models/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)