PMC-LaMP is a tool for creating custom RAG-enabled chatbots based on PubMed Central medical literature. This pipeline allows you to search for scientific articles on a specific topic, download them, create a vector index, and generate a local chatbot that can answer questions based on the literature.
- Python 3.8+
- Linux/Unix bash environment
- Internet connection
- At least 8GB of RAM
- Available disk space proportional to the number of articles (approximately 1GB per 10,000 articles)
This tool requires a bash environment to run. Windows users must use one of the following options:
- Install Windows Subsystem for Linux (WSL) - recommended
- Use a Linux virtual machine
- Use Git Bash (may have limited functionality)
All commands in this README assume a Linux/Unix bash environment.
-
Clone this repository:
git clone https://github.com/yourusername/PMC-LaMP.git cd PMC-LaMP -
Create a virtual environment and activate it using
uv:uv venv source .venv/bin/activate -
Install the required dependencies:
uv pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory and add:SERVER_IP=127.0.0.1
The easiest way to use PMC-LaMP is with our interactive script that guides you through the entire process:
python guided_pmc_lamp.pyThis interactive script will:
- Check your environment and dependencies
- Ask for your medical topic keyword
- Guide you through obtaining PMCIDs
- Download the articles with real-time progress display
- Generate the FAISS index with progress tracking
- Configure the application
- Start the chatbot servers
You can also provide a topic keyword directly:
python guided_pmc_lamp.py --keyword crohn's- Real-time progress tracking - See exactly how many articles are downloading and indexing
- Percentage completion - Track overall progress during lengthy operations
- Interactive prompts - Guided experience with clear instructions
- Error handling - Helpful error messages if something goes wrong
- Smart defaults - Automatic detection and use of existing files
The script will walk you through each step and handle all the technical details automatically.
For advanced users who prefer to run each step manually, follow the detailed pipeline below:
- Go to https://pmc.ncbi.nlm.nih.gov/ and search for your topic
- On the left column, apply filters:
- Select the 'Open Access' filter under article attributes
- Optionally, limit the publication date range
- Export the results:
- Click 'Send to:' (located under the right side of the search bar)
- Under 'Choose Destination', select 'File'
- In the 'Format' dropdown, select 'PMCID List'
- Click the 'Create File' button to download the file
- Create a directory called
pmcidsin the project root if it doesn't exist - Move the downloaded file to the
pmcidsdirectory and rename it if desired (format:{keyword}_pmc_result.txt)
Run the article download script with your PMCID list:
bash fetch_pmc_articles.sh pmcids/{keyword}_pmc_result.txtThis will download BioC JSON formatted articles to fulltext_articles/{keyword}_pmc_articles/ directory.
Generate a vector index from the downloaded articles:
python index_generator.py --document_path fulltext_articles/{keyword}_pmc_articles/ --input_type jsonOptional parameters:
--max_files: Maximum number of files to process (default: 250000)--group_size: Number of documents to process per group (default: 1000)--chunk_size: Size of text chunks for indexing (default: 1000)--chunk_overlap: Overlap between chunks (default: 20)
The index will be saved to indexes/faiss_index/.
Update the config.py file to point to your newly created index:
FAISS_INDEX = "indexes/faiss_index"-
Start the API server:
python app.py
-
In a separate terminal, start the Streamlit interface:
streamlit run PMC-LaMP.py
-
Navigate to the Chatbot page from the sidebar and start asking questions!
In config.py, you can change the models used:
READER_MODEL: The LLM used for generating responsesEMBEDDING_MODEL: The model used for text embeddingsRERANKER_MODEL: The model used for reranking search results
- If the article download fails, check your internet connection and try again.
- If you encounter memory issues during index generation, try reducing the
--group_sizeparameter. - If the chatbot doesn't start, check that both the API server and Streamlit interface are running.
- For Windows users, ensure you have properly set up WSL or a Linux VM before attempting to run the application.
For questions or feedback, please contact: valiant@vanderbilt.edu