|
1 | | -# Granular Package |
| 1 | +# **Granular Pipeline** |
| 2 | +Our goal is to create a finegrained dataset of biomedical subfigure-subcaption pairs from the raw dataset of PMC figure-caption pairs. We assume that a dataset of PMC figure-caption pairs, e.g. PMC-17M, is already downloaded, formatted as a directory of JSONL files and a directory of image .jpg files. Note that all .sh files require you to pass in the JSONL numbers from the PMC dataset as arguments. |
2 | 3 |
|
3 | | -This package contains tools to extract sub-figures and sub-captions from downloaded image-caption pairs. |
4 | | -This enlarges the dataset, and may increase the quality of the data as well since the sub-pairs will be more focused and less confusing. |
| 4 | +Sample command: |
| 5 | +```bash |
| 6 | +sbatch openpmcvl/granular/pipeline/preprocess.sh 0 1 2 3 4 5 6 7 8 9 10 11 |
| 7 | +``` |
| 8 | + |
| 9 | + |
| 10 | +## **1. Preprocess** |
| 11 | +> **Code:** `preprocess.py & preprocess.sh` <br> |
| 12 | +> **Input:** Directory of figures and PMC metadata in JSONL format <br> |
| 13 | +> **Output:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br> |
| 14 | +
|
| 15 | +- Filter out figure-caption pairs that are not .jpg images, missing, or corrupted. |
| 16 | +- Filter for figure-caption pairs that contain target biomedical keywords. |
| 17 | + |
| 18 | +Each datapoint contains the following fields: |
| 19 | +- `id`: A unique identifier for the figure-caption pair. |
| 20 | +- `PMC_ID`: The PMC ID of the article. |
| 21 | +- `caption`: The caption of the figure. |
| 22 | +- `image_path`: The path to the image file. |
| 23 | +- `width`: The width of the image in pixels. |
| 24 | +- `height`: The height of the image in pixels. |
| 25 | +- `media_id`: The ID of the media file. |
| 26 | +- `media_url`: The URL of the media file. |
| 27 | +- `media_name`: The name of the media file. |
| 28 | +- `keywords`: The keywords found in the caption. |
| 29 | +- `is_medical`: Whether the caption contains any target biomedical keywords. |
| 30 | +<br><br> |
| 31 | + |
| 32 | +This script saves the output both as a directory of processed JSONL files and a merged JSONL file. The former is used in the next step of the pipeline. |
| 33 | +<br><br> |
| 34 | + |
| 35 | + |
| 36 | +## **2. Subfigure Extraction** |
| 37 | +> **Code:** `subfigure.py & subfigure.sh` <br> |
| 38 | +> **Input:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br> |
| 39 | +> **Output:** Directory of subfigure jpg files, and subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br> |
| 40 | +
|
| 41 | +- Breakdown compound figures into subfigures. |
| 42 | +- Keep original figure for non-compound figures or if an exception occurs. |
| 43 | + |
| 44 | +Each datapoint contains the following fields: |
| 45 | + |
| 46 | +When a subfigure is successfully detected and separated: |
| 47 | +- `id`: Unique identifier for the subfigure (format: {source_figure_id}_{subfigure_number}.jpg) |
| 48 | +- `source_fig_id`: ID of the original compound figure |
| 49 | +- `PMC_ID`: PMC ID of the source article |
| 50 | +- `media_name`: Original filename of the compound figure |
| 51 | +- `position`: Coordinates of subfigure bounding box [(x1,y1), (x2,y2)] |
| 52 | +- `score`: Detection confidence score |
| 53 | +- `subfig_path`: Path to saved subfigure image |
| 54 | + |
| 55 | +When subfigure extraction fails: |
| 56 | +- `id`: Generated ID that would have been used |
| 57 | +- `source_fig_id`: ID of the original figure |
| 58 | +- `PMC_ID`: PMC ID of the source article |
| 59 | +- `media_name`: Original filename |
| 60 | + |
| 61 | +This script saves extracted subfigures as .jpg files in the target directory. Metadata for each subfigure is stored in separate JSONL files, with unique IDs that link back to the original figure-caption pairs in the source JSONL files. |
| 62 | +<br><br> |
| 63 | + |
| 64 | + |
| 65 | +## **3. Subcaption Extraction** |
| 66 | +> **Code:** `subcaption.ipynb | subcaption.py & subcaption.sh` <br> |
| 67 | +> **Input:** PMC metadata in JSONL format <br> |
| 68 | +> **Output:** PMC metadata in JSONL format with subcaptions <br> |
| 69 | +
|
| 70 | +- Extract subcaptions from captions. |
| 71 | +- Keep original caption if the caption cannot be split into subcaptions. |
| 72 | + |
| 73 | +While this pipeline works, its slow as it goes through API calls one by one. There is a notebook (`subcaption.ipynb`) using batch API calls to speed it up. It's highly recommended to use the notebook instead of this script. |
| 74 | +<br><br> |
| 75 | + |
| 76 | + |
| 77 | +## **4. Classification** |
| 78 | +> **Code:** `classify.py & classify.sh` <br> |
| 79 | +> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br> |
| 80 | +> **Output:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br> |
| 81 | +
|
| 82 | +- Classify subfigures and include metadata about their class. |
| 83 | + |
| 84 | +The following fields are added to each datapoint: |
| 85 | +- `is_medical_subfigure`: Whether the subfigure is a medical subfigure. |
| 86 | +- `medical_class_rank`: The model's confidence in the medical classification. |
| 87 | + |
| 88 | +This script preserves all subfigures and adds an `is_medical_subfigure` boolean flag to identify medical subfigures. It also includes a `medical_class_rank` field indicating the model's confidence in the medical classification. |
| 89 | +<br><br> |
| 90 | + |
| 91 | + |
| 92 | +## **5. Alignment** |
| 93 | +> **Code:** `align.py & align.sh` <br> |
| 94 | +> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br> |
| 95 | +> **Output:** Aligned subfigure metadata in JSONL format (`${num}_aligned.jsonl`) <br> |
| 96 | +
|
| 97 | +- Find the label associated with each subfigure. |
| 98 | +- If no label is found, it means either: |
| 99 | + - The image is a standalone figure (not part of a compound figure) |
| 100 | + - The OCR model failed to detect the subfigure label (e.g. "A", "B", etc.) |
| 101 | + |
| 102 | +The non biomedical subfigures will be removed. The following fields are added to each datapoint: |
| 103 | +- `label`: The label associated with the subfigure. (e.g. "Subfigure-A") |
| 104 | +- `label_position`: The position of the label in the subfigure. |
| 105 | + |
| 106 | + |
| 107 | +The outputs from steps 3 and 5 contain labeled subcaptions and labeled subfigures respectively. By matching these labels (e.g. "Subfigure-A"), we can create the final subfigure-subcaption pairs. Any cases where labels are missing or captions couldn't be split will be handled in subsequent steps. Refer to notebook for more details. |
| 108 | +<br><br> |
0 commit comments