From 21c3c3a04114b6f1697d40f5cb11fbc2f96f9da7 Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Wed, 12 Mar 2025 23:22:11 -0400 Subject: [PATCH 1/9] Update README.md Added HF links and table of contents --- README.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index c49efa1..17d4afc 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,19 @@ A toolkit to download, augment, and benchmark OpenPMC-VL; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. +## Table of Contents + +1. [Hugging Face Dataset and Checkpoint](#hugging-face-dataset-and-checkpoint) +2. [Installing Dependencies](#installing-dependencies) +3. [Download and Parse Image-Caption Pairs](#download-and-parse-image-caption-pairs-from-pubmed-articles) +4. [Run Benchmarking Experiments](#run-benchmarking-experiments) +5. [References](#references) + +## Hugging Face Dataset and Checkpoint + +- **Dataset:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open_pmc) +- **Checkpoint:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open_pmc_clip) + ## Installing dependencies We use @@ -75,7 +88,6 @@ python **Note:** Since these submodules (`mmlearn` and `open_clip`) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. - ## Download and parse image-caption pairs from Pubmed Articles The codebase used to download Pubmed articles and parse image-text pairs from them is stored in `openpmcvl/foundation`. This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1). @@ -97,7 +109,6 @@ To download and parse open-access articles which other licenses than what is men python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11 ``` - ## Run Benchmarking Experiments We use `mmlearn` to run benchmarking experiments. Many experiments can be run with our dataset and `mmlearn`. @@ -136,8 +147,6 @@ mmlearn_run \ For more comprehensive examples of shell scripts that run various experiments with OpenPMC-VL, refer to `openpmcvl/experiment/scripts`. For more information about `mmlearn`, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). - - ## References [1] PMC-OA paper: ```latex From 7c3b12e6a561ec48691889454fe08ea6ad890a5d Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Thu, 20 Mar 2025 11:49:06 -0400 Subject: [PATCH 2/9] Update README.md --- README.md | 88 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 46 insertions(+), 42 deletions(-) diff --git a/README.md b/README.md index 17d4afc..d7cda16 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# OpenPMC-VL +# OpenPMC ---------------------------------------------------------------------------------------- @@ -6,7 +6,12 @@ [![integration tests](https://github.com/VectorInstitute/aieng-template/actions/workflows/integration_tests.yml/badge.svg)](https://github.com/VectorInstitute/pmc-data-extraction/actions/workflows/integration_tests.yml) [![license](https://img.shields.io/github/license/VectorInstitute/aieng-template.svg)](https://github.com/VectorInstitute/pmc-data-extraction/blob/main/LICENSE.md) -A toolkit to download, augment, and benchmark OpenPMC-VL; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. +A toolkit to download, augment, and benchmark OpenPMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. + +For more details, see the following resources: +- **arXiv Paper:** [PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents](http://arxiv.org/abs/2503.14377) +- **Dataset on Hugging Face:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open_pmc) +- **Model Checkpoint on Hugging Face:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open_pmc_clip) ## Table of Contents @@ -29,54 +34,54 @@ for dependency management. Please make sure it is installed. Then, follow below instructions to set up your virtual environment. 1. Create a venv with python3.10 and activate it. -```bash +bash python --version # must print 3.10 python -m venv source /bin/activate -``` + 2. Navigate to the root directory of pmc-data-extraction repository and install dependencies. Two of the required dependencies are [mmlearn](https://github.com/VectorInstitute/mmlearn) and [open_clip](https://github.com/mlfoundations/open_clip). -You have the option to either install them with `pip` or from source. +You have the option to either install them with pip or from source. -To install `mmlearn` and `open_clip` with `pip`, run -```bash +To install mmlearn and open_clip with pip, run +bash cd path/to/pmc-data-extraction pip install --upgrade pip poetry install --no-root --with test,open_clip,mmlearn --all-extras -``` + then skip to step 6: Check Installations. -To install `mmlearn` and `open_clip` from source, run -```bash +To install mmlearn and open_clip from source, run +bash cd path/to/pmc-data-extraction pip install --upgrade pip poetry install --no-root --with test --all-extras -``` -The above command assumes that you would install `mmlearn` or `open_clip` packages from source using the submodules found in `pmc-data-extraction/openpmcvl/`experiment. -3. Clone `mmlearn` and `open_clip` submodules. -```bash +The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/openpmcvl/experiment. + +3. Clone mmlearn and open_clip submodules. +bash git submodule init git submodule update -``` -You should see the source files inside `pmc-data-extraction/openpmcvl/experiment/open_clip` and `pmc-data-extraction/openpmcvl/experiment/mmlearn`. -4. Install `mmlearn` from source. -```bash +You should see the source files inside pmc-data-extraction/openpmcvl/experiment/open_clip and pmc-data-extraction/openpmcvl/experiment/mmlearn. + +4. Install mmlearn from source. +bash cd openpmcvl/experiment/mmlearn python3 -m pip install -e . -``` -5. Install `open_clip` from source. -```bash + +5. Install open_clip from source. +bash cd ../open_clip make install make install-training -``` + 6. Check installations. -```bash +bash pip freeze | grep mmlearn pip freeze | grep open_clip python @@ -84,36 +89,36 @@ python > import open_clip > mmlearn.__file__ > open_clip.__file__ -``` -**Note:** Since these submodules (`mmlearn` and `open_clip`) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. + +**Note:** Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. ## Download and parse image-caption pairs from Pubmed Articles -The codebase used to download Pubmed articles and parse image-text pairs from them is stored in `openpmcvl/foundation`. +The codebase used to download Pubmed articles and parse image-text pairs from them is stored in openpmcvl/foundation. This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1). To download and parse articles with licenses that allow commercial use, run -```bash +bash # activate virtual environment source /path/to/your/venv/bin/activate # navigate to root directory of the package cd openpmcvl/foundation # download all 11 volumes with commercailly usable license python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11 -``` + To download and parse open-access articles which are not allowed commercial use, run -```bash +bash python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/noncommercial --license-type noncomm --volumes 1 2 3 4 5 6 7 8 9 10 11 -``` + To download and parse open-access articles which other licenses than what is mentioned above, run -```bash +bash python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11 -``` + ## Run Benchmarking Experiments -We use `mmlearn` to run benchmarking experiments. -Many experiments can be run with our dataset and `mmlearn`. +We use mmlearn to run benchmarking experiments. +Many experiments can be run with our dataset and mmlearn. A simple example of training with our dataset is given below: -```bash +bash # navigate to root directory of the repository cd pmc-data-extraction # set pythonpath @@ -126,11 +131,11 @@ mmlearn_run \ dataloader.train.batch_size=256 \ task.encoders.text.pretrained=False \ task.encoders.rgb.pretrained=False -``` + Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval. An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below: -```bash +bash mmlearn_run \ 'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \ +experiment=pmcoa2_matched \ @@ -143,17 +148,16 @@ mmlearn_run \ datasets.test.mimic.transform.job_type=eval \ dataloader.test.batch_size=64 \ resume_from_checkpoint="path/to/model/checkpoint" -``` -For more comprehensive examples of shell scripts that run various experiments with OpenPMC-VL, refer to `openpmcvl/experiment/scripts`. -For more information about `mmlearn`, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). + +For more comprehensive examples of shell scripts that run various experiments with OpenPMC, refer to openpmcvl/experiment/scripts. +For more information about mmlearn, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). ## References [1] PMC-OA paper: -```latex +latex @article{lin2023pmc, title={PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents}, author={Lin, Weixiong and Zhao, Ziheng and Zhang, Xiaoman and Wu, Chaoyi and Zhang, Ya and Wang, Yanfeng and Xie, Weidi}, journal={arXiv preprint arXiv:2303.07240}, year={2023} } -``` From e9b848d37d45261320af574a2e409b8c82b9d092 Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 13:01:11 -0400 Subject: [PATCH 3/9] Update README.md Fixed HF URLs --- README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index d7cda16..238bab6 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# OpenPMC +# Open-PMC ---------------------------------------------------------------------------------------- @@ -6,12 +6,12 @@ [![integration tests](https://github.com/VectorInstitute/aieng-template/actions/workflows/integration_tests.yml/badge.svg)](https://github.com/VectorInstitute/pmc-data-extraction/actions/workflows/integration_tests.yml) [![license](https://img.shields.io/github/license/VectorInstitute/aieng-template.svg)](https://github.com/VectorInstitute/pmc-data-extraction/blob/main/LICENSE.md) -A toolkit to download, augment, and benchmark OpenPMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. +A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. For more details, see the following resources: - **arXiv Paper:** [PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents](http://arxiv.org/abs/2503.14377) -- **Dataset on Hugging Face:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open_pmc) -- **Model Checkpoint on Hugging Face:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open_pmc_clip) +- **Dataset on Hugging Face:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) +- **Model Checkpoint on Hugging Face:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) ## Table of Contents @@ -58,18 +58,18 @@ cd path/to/pmc-data-extraction pip install --upgrade pip poetry install --no-root --with test --all-extras -The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/openpmcvl/experiment. +The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/Open-PMCvl/experiment. 3. Clone mmlearn and open_clip submodules. bash git submodule init git submodule update -You should see the source files inside pmc-data-extraction/openpmcvl/experiment/open_clip and pmc-data-extraction/openpmcvl/experiment/mmlearn. +You should see the source files inside pmc-data-extraction/Open-PMCvl/experiment/open_clip and pmc-data-extraction/Open-PMCvl/experiment/mmlearn. 4. Install mmlearn from source. bash -cd openpmcvl/experiment/mmlearn +cd Open-PMCvl/experiment/mmlearn python3 -m pip install -e . @@ -94,14 +94,14 @@ python **Note:** Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. ## Download and parse image-caption pairs from Pubmed Articles -The codebase used to download Pubmed articles and parse image-text pairs from them is stored in openpmcvl/foundation. +The codebase used to download Pubmed articles and parse image-text pairs from them is stored in Open-PMCvl/foundation. This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1). To download and parse articles with licenses that allow commercial use, run bash # activate virtual environment source /path/to/your/venv/bin/activate # navigate to root directory of the package -cd openpmcvl/foundation +cd Open-PMCvl/foundation # download all 11 volumes with commercailly usable license python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11 @@ -125,7 +125,7 @@ cd pmc-data-extraction export PYTHONPATH="./" # run training experiment mmlearn_run \ - 'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \ + 'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \ +experiment=pmcoa2_matched \ experiment_name=pmcoa2_matched_train \ dataloader.train.batch_size=256 \ @@ -137,7 +137,7 @@ Four downstream evaluation experiments can be run with checkpoints generated dur An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below: bash mmlearn_run \ - 'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \ + 'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \ +experiment=pmcoa2_matched \ experiment_name=pmcoa2_matched_retrieval_mimic \ job_type=eval \ @@ -149,7 +149,7 @@ mmlearn_run \ dataloader.test.batch_size=64 \ resume_from_checkpoint="path/to/model/checkpoint" -For more comprehensive examples of shell scripts that run various experiments with OpenPMC, refer to openpmcvl/experiment/scripts. +For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to Open-PMCvl/experiment/scripts. For more information about mmlearn, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). ## References From 830395fb0b58f4f7afd4502b5c2f0bbcfd6792b6 Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 13:30:27 -0400 Subject: [PATCH 4/9] Update README.md Added citation --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index 238bab6..8ebe4cb 100644 --- a/README.md +++ b/README.md @@ -152,6 +152,17 @@ mmlearn_run \ For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to Open-PMCvl/experiment/scripts. For more information about mmlearn, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). +## Citation +If you find the code useful for your research, please consider citing +```bib +@article{baghbanzadeh2025advancing, + title={Advancing Medical Representation Learning Through High-Quality Data}, + author={Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and others}, + journal={arXiv preprint arXiv:2503.14377}, + year={2025} +} +``` + ## References [1] PMC-OA paper: latex From 73b7ee006ec84e9e2ba1cc152cf49b9f9dc96b6a Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 14:53:00 -0400 Subject: [PATCH 5/9] Update README.md --- README.md | 87 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 48 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index 8ebe4cb..d8d4465 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,12 @@ [![integration tests](https://github.com/VectorInstitute/aieng-template/actions/workflows/integration_tests.yml/badge.svg)](https://github.com/VectorInstitute/pmc-data-extraction/actions/workflows/integration_tests.yml) [![license](https://img.shields.io/github/license/VectorInstitute/aieng-template.svg)](https://github.com/VectorInstitute/pmc-data-extraction/blob/main/LICENSE.md) +
+ Open-PMC Pipeline +
+ A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. For more details, see the following resources: @@ -34,54 +40,54 @@ for dependency management. Please make sure it is installed. Then, follow below instructions to set up your virtual environment. 1. Create a venv with python3.10 and activate it. -bash +```bash python --version # must print 3.10 python -m venv source /bin/activate - +``` 2. Navigate to the root directory of pmc-data-extraction repository and install dependencies. Two of the required dependencies are [mmlearn](https://github.com/VectorInstitute/mmlearn) and [open_clip](https://github.com/mlfoundations/open_clip). -You have the option to either install them with pip or from source. +You have the option to either install them with `pip` or from source. -To install mmlearn and open_clip with pip, run -bash +To install `mmlearn` and `open_clip` with `pip`, run +```bash cd path/to/pmc-data-extraction pip install --upgrade pip poetry install --no-root --with test,open_clip,mmlearn --all-extras - +``` then skip to step 6: Check Installations. -To install mmlearn and open_clip from source, run -bash +To install `mmlearn` and `open_clip` from source, run +```bash cd path/to/pmc-data-extraction pip install --upgrade pip poetry install --no-root --with test --all-extras +``` +The above command assumes that you would install `mmlearn` or `open_clip` packages from source using the submodules found in `pmc-data-extraction/openpmcvl/`experiment. -The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/Open-PMCvl/experiment. - -3. Clone mmlearn and open_clip submodules. -bash +3. Clone `mmlearn` and `open_clip` submodules. +```bash git submodule init git submodule update +``` +You should see the source files inside `pmc-data-extraction/openpmcvl/experiment/open_clip` and `pmc-data-extraction/openpmcvl/experiment/mmlearn`. -You should see the source files inside pmc-data-extraction/Open-PMCvl/experiment/open_clip and pmc-data-extraction/Open-PMCvl/experiment/mmlearn. - -4. Install mmlearn from source. -bash -cd Open-PMCvl/experiment/mmlearn +4. Install `mmlearn` from source. +```bash +cd openpmcvl/experiment/mmlearn python3 -m pip install -e . +``` - -5. Install open_clip from source. -bash +5. Install `open_clip` from source. +```bash cd ../open_clip make install make install-training - +``` 6. Check installations. -bash +```bash pip freeze | grep mmlearn pip freeze | grep open_clip python @@ -89,55 +95,57 @@ python > import open_clip > mmlearn.__file__ > open_clip.__file__ +``` +**Note:** Since these submodules (`mmlearn` and `open_clip`) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. -**Note:** Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors. ## Download and parse image-caption pairs from Pubmed Articles -The codebase used to download Pubmed articles and parse image-text pairs from them is stored in Open-PMCvl/foundation. +The codebase used to download Pubmed articles and parse image-text pairs from them is stored in `openpmcvl/foundation`. This codebase heavily relies on [Build PMC-OA](https://github.com/WeixiongLin/Build-PMC-OA) codebase[[1]](#1). To download and parse articles with licenses that allow commercial use, run -bash +```bash # activate virtual environment source /path/to/your/venv/bin/activate # navigate to root directory of the package -cd Open-PMCvl/foundation +cd openpmcvl/foundation # download all 11 volumes with commercailly usable license python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11 - +``` To download and parse open-access articles which are not allowed commercial use, run -bash +```bash python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/noncommercial --license-type noncomm --volumes 1 2 3 4 5 6 7 8 9 10 11 - +``` To download and parse open-access articles which other licenses than what is mentioned above, run -bash +```bash python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11 +``` ## Run Benchmarking Experiments -We use mmlearn to run benchmarking experiments. -Many experiments can be run with our dataset and mmlearn. +We use `mmlearn` to run benchmarking experiments. +Many experiments can be run with our dataset and `mmlearn`. A simple example of training with our dataset is given below: -bash +```bash # navigate to root directory of the repository cd pmc-data-extraction # set pythonpath export PYTHONPATH="./" # run training experiment mmlearn_run \ - 'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \ + 'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \ +experiment=pmcoa2_matched \ experiment_name=pmcoa2_matched_train \ dataloader.train.batch_size=256 \ task.encoders.text.pretrained=False \ task.encoders.rgb.pretrained=False - +``` Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval. An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below: -bash +```bash mmlearn_run \ - 'hydra.searchpath=[pkg://Open-PMCvl.experiment.configs]' \ + 'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \ +experiment=pmcoa2_matched \ experiment_name=pmcoa2_matched_retrieval_mimic \ job_type=eval \ @@ -148,9 +156,10 @@ mmlearn_run \ datasets.test.mimic.transform.job_type=eval \ dataloader.test.batch_size=64 \ resume_from_checkpoint="path/to/model/checkpoint" +``` +For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to `openpmcvl/experiment/scripts`. +For more information about `mmlearn`, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). -For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to Open-PMCvl/experiment/scripts. -For more information about mmlearn, please refer to the package's [official codebase](https://github.com/VectorInstitute/mmlearn). ## Citation If you find the code useful for your research, please consider citing From bab5c4c042e0488d457ff5c78eef5465b1ab0435 Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 16:13:25 -0400 Subject: [PATCH 6/9] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d8d4465..ed71be4 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. For more details, see the following resources: -- **arXiv Paper:** [PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents](http://arxiv.org/abs/2503.14377) -- **Dataset on Hugging Face:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) -- **Model Checkpoint on Hugging Face:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) +- **arXiv Paper:** [Advancing Medical Representation Learning Through High-Quality Data](http://arxiv.org/abs/2503.14377) +- **Dataset on Hugging Face:** [Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) +- **Model Checkpoint on Hugging Face:** [Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) ## Table of Contents From a5a6038c8ab62ca7254944e389e4d61aa83c35b8 Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 16:22:37 -0400 Subject: [PATCH 7/9] Update README.md --- README.md | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index ed71be4..f675d15 100644 --- a/README.md +++ b/README.md @@ -15,22 +15,16 @@ A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. For more details, see the following resources: -- **arXiv Paper:** [Advancing Medical Representation Learning Through High-Quality Data](http://arxiv.org/abs/2503.14377) -- **Dataset on Hugging Face:** [Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) -- **Model Checkpoint on Hugging Face:** [Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) +- **[arXiv Paper]:** [http://arxiv.org/abs/2503.14377](http://arxiv.org/abs/2503.14377) +- **[Dataset]:** [Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) +- **[Model Checkpoint]:** [Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) ## Table of Contents -1. [Hugging Face Dataset and Checkpoint](#hugging-face-dataset-and-checkpoint) -2. [Installing Dependencies](#installing-dependencies) -3. [Download and Parse Image-Caption Pairs](#download-and-parse-image-caption-pairs-from-pubmed-articles) -4. [Run Benchmarking Experiments](#run-benchmarking-experiments) -5. [References](#references) - -## Hugging Face Dataset and Checkpoint - -- **Dataset:** [Open_PMC Dataset on Hugging Face](https://huggingface.co/datasets/vector-institute/open_pmc) -- **Checkpoint:** [Open_PMC_CLIP Model Checkpoint on Hugging Face](https://huggingface.co/vector-institute/open_pmc_clip) +1. [Installing Dependencies](#installing-dependencies) +2. [Download and Parse Image-Caption Pairs](#download-and-parse-image-caption-pairs-from-pubmed-articles) +3. [Run Benchmarking Experiments](#run-benchmarking-experiments) +4. [Citation](#citation) ## Installing dependencies @@ -171,13 +165,3 @@ If you find the code useful for your research, please consider citing year={2025} } ``` - -## References -[1] PMC-OA paper: -latex -@article{lin2023pmc, - title={PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents}, - author={Lin, Weixiong and Zhao, Ziheng and Zhang, Xiaoman and Wu, Chaoyi and Zhang, Ya and Wang, Yanfeng and Xie, Weidi}, - journal={arXiv preprint arXiv:2303.07240}, - year={2023} -} From 9325acaa27588077f8bcb79282c87812a5a97daf Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 16:23:15 -0400 Subject: [PATCH 8/9] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f675d15..7464a1f 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ If you find the code useful for your research, please consider citing ```bib @article{baghbanzadeh2025advancing, title={Advancing Medical Representation Learning Through High-Quality Data}, - author={Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and others}, + author={Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and Dolatabadi, Elham}, journal={arXiv preprint arXiv:2503.14377}, year={2025} } From 573864053c53e7227a9d41b342c9758a6f7c435f Mon Sep 17 00:00:00 2001 From: Negin <58723916+Negiiiin@users.noreply.github.com> Date: Tue, 25 Mar 2025 16:46:16 -0400 Subject: [PATCH 9/9] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 7464a1f..0b0604e 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral. For more details, see the following resources: -- **[arXiv Paper]:** [http://arxiv.org/abs/2503.14377](http://arxiv.org/abs/2503.14377) -- **[Dataset]:** [Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc) -- **[Model Checkpoint]:** [Hugging Face](https://huggingface.co/vector-institute/open-pmc-clip) +- **arXiv Paper:** [http://arxiv.org/abs/2503.14377](http://arxiv.org/abs/2503.14377) +- **Dataset:** [https://huggingface.co/datasets/vector-institute/open-pmc](https://huggingface.co/datasets/vector-institute/open-pmc) +- **Model Checkpoint:** [https://huggingface.co/vector-institute/open-pmc-clip](https://huggingface.co/vector-institute/open-pmc-clip) ## Table of Contents