PMC-LaMP: PubMed Central Language Model Pipeline

PMC-LaMP is a tool for creating custom RAG-enabled chatbots based on PubMed Central medical literature. This pipeline allows you to search for scientific articles on a specific topic, download them, create a vector index, and generate a local chatbot that can answer questions based on the literature.

Prerequisites

Python 3.8+
Linux/Unix bash environment
Internet connection
At least 8GB of RAM
Available disk space proportional to the number of articles (approximately 1GB per 10,000 articles)

Important Note for Windows Users

This tool requires a bash environment to run. Windows users must use one of the following options:

Install Windows Subsystem for Linux (WSL) - recommended
Use a Linux virtual machine
Use Git Bash (may have limited functionality)

All commands in this README assume a Linux/Unix bash environment.

Installation

Clone this repository:

git clone https://github.com/yourusername/PMC-LaMP.git
cd PMC-LaMP

Create a virtual environment and activate it using uv:
```
uv venv
source .venv/bin/activate
```
Install the required dependencies:
```
uv pip install -r requirements.txt
```
Set up environment variables: Create a .env file in the root directory and add:
```
SERVER_IP=127.0.0.1
```

One-Command Setup and Run (Recommended)

The easiest way to use PMC-LaMP is with our interactive script that guides you through the entire process:

python guided_pmc_lamp.py

This interactive script will:

Check your environment and dependencies
Ask for your medical topic keyword
Guide you through obtaining PMCIDs
Download the articles with real-time progress display
Generate the FAISS index with progress tracking
Configure the application
Start the chatbot servers

You can also provide a topic keyword directly:

python guided_pmc_lamp.py --keyword crohn's

User-Friendly Features

Real-time progress tracking - See exactly how many articles are downloading and indexing
Percentage completion - Track overall progress during lengthy operations
Interactive prompts - Guided experience with clear instructions
Error handling - Helpful error messages if something goes wrong
Smart defaults - Automatic detection and use of existing files

The script will walk you through each step and handle all the technical details automatically.

Manual Pipeline Workflow

For advanced users who prefer to run each step manually, follow the detailed pipeline below:

Step 1: Collect PMCIDs for your topic

Go to https://pmc.ncbi.nlm.nih.gov/ and search for your topic
On the left column, apply filters:
- Select the 'Open Access' filter under article attributes
- Optionally, limit the publication date range
Export the results:
- Click 'Send to:' (located under the right side of the search bar)
- Under 'Choose Destination', select 'File'
- In the 'Format' dropdown, select 'PMCID List'
- Click the 'Create File' button to download the file
Create a directory called pmcids in the project root if it doesn't exist
Move the downloaded file to the pmcids directory and rename it if desired (format: {keyword}_pmc_result.txt)

Step 2: Download Articles from PubMed Central

Run the article download script with your PMCID list:

bash fetch_pmc_articles.sh pmcids/{keyword}_pmc_result.txt

This will download BioC JSON formatted articles to fulltext_articles/{keyword}_pmc_articles/ directory.

Step 3: Generate FAISS Index

Generate a vector index from the downloaded articles:

python index_generator.py --document_path fulltext_articles/{keyword}_pmc_articles/ --input_type json

Optional parameters:

--max_files: Maximum number of files to process (default: 250000)
--group_size: Number of documents to process per group (default: 1000)
--chunk_size: Size of text chunks for indexing (default: 1000)
--chunk_overlap: Overlap between chunks (default: 20)

The index will be saved to indexes/faiss_index/.

Step 4: Configure the Chatbot

Update the config.py file to point to your newly created index:

FAISS_INDEX = "indexes/faiss_index"

Step 5: Start the Chatbot

Start the API server:
```
python app.py
```
In a separate terminal, start the Streamlit interface:
```
streamlit run PMC-LaMP.py
```
Navigate to the Chatbot page from the sidebar and start asking questions!

Advanced Configuration

Changing Models

In config.py, you can change the models used:

READER_MODEL: The LLM used for generating responses
EMBEDDING_MODEL: The model used for text embeddings
RERANKER_MODEL: The model used for reranking search results

Troubleshooting

If the article download fails, check your internet connection and try again.
If you encounter memory issues during index generation, try reducing the --group_size parameter.
If the chatbot doesn't start, check that both the API server and Streamlit interface are running.
For Windows users, ensure you have properly set up WSL or a Linux VM before attempting to run the application.

Contact

For questions or feedback, please contact: valiant@vanderbilt.edu

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
routers		routers
services		services
.gitignore		.gitignore
PMC-LaMP.py		PMC-LaMP.py
README.md		README.md
app.py		app.py
config.py		config.py
fetch_pmc_articles.sh		fetch_pmc_articles.sh
guided_pmc_lamp.py		guided_pmc_lamp.py
index_generator.py		index_generator.py
schemas.py		schemas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PMC-LaMP: PubMed Central Language Model Pipeline

Prerequisites

Important Note for Windows Users

Installation

One-Command Setup and Run (Recommended)

User-Friendly Features

Manual Pipeline Workflow

Step 1: Collect PMCIDs for your topic

Step 2: Download Articles from PubMed Central

Step 3: Generate FAISS Index

Step 4: Configure the Chatbot

Step 5: Start the Chatbot

Advanced Configuration

Changing Models

Troubleshooting

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PMC-LaMP: PubMed Central Language Model Pipeline

Prerequisites

Important Note for Windows Users

Installation

One-Command Setup and Run (Recommended)

User-Friendly Features

Manual Pipeline Workflow

Step 1: Collect PMCIDs for your topic

Step 2: Download Articles from PubMed Central

Step 3: Generate FAISS Index

Step 4: Configure the Chatbot

Step 5: Start the Chatbot

Advanced Configuration

Changing Models

Troubleshooting

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages