- Biomedical GraphRAG
A comprehensive GraphRAG (Graph Retrieval-Augmented Generation) system designed for biomedical research. It combines knowledge graphs with vector search to provide intelligent querying and analysis of biomedical literature and genomic data.
Article: Building a Biomedical GraphRAG: When Knowledge Graphs Meet Vector Search
Key Features:
- Hybrid Query System: Combines Neo4j graph database with Qdrant vector search for comprehensive biomedical insights
- Data Integration: Processes PubMed papers, gene data, and research citations
- Intelligent Querying: Uses LLM-powered tool selection for graph enrichment and semantic search
- Biomedical Schema: Specialized graph schema for papers, authors, institutions, genes, and MeSH terms
- Async Processing: High-performance async data collection and processing
biomedical-graphrag-pipeline/
├── .github/ # GitHub workflows and templates
├── data/ # Dataset storage (PubMed, Gene data)
├── docs/ # Documentation
├── src/
│ └── biomedical_graphrag/
│ ├── application/ # Application layer
│ │ ├── cli/ # Command-line interfaces
│ │ └── services/ # Business logic services
│ ├── config.py # Configuration management
│ ├── data_sources/ # Data collection modules
│ ├── domain/ # Domain models and entities
│ ├── infrastructure/ # Database and external service adapters
│ └── utils/ # Utility functions
├── static/ # Static assets (images, etc.)
├── tests/ # Test suite
├── LICENSE # MIT License
├── Makefile # Build and development commands
├── pyproject.toml # Project configuration and dependencies
├── README.md # This file
└── uv.lock # Dependency lock file
| Requirement | Description |
|---|---|
| Python 3.13+ | Programming language |
| uv | Package and dependency manager |
| Neo4j | Graph database for knowledge graphs |
| Qdrant | Vector database for embeddings |
| OpenAI | LLM provider for queries and embeddings |
| PubMed | Biomedical literature database |
-
Clone the repository:
git clone [email protected]:benitomartin/biomedical-graphrag.git cd biomedical-graphrag
-
Create a virtual environment:
uv venv
-
Activate the virtual environment:
source .venv/bin/activate -
Install the required packages:
uv sync --all-groups --all-extra
-
Create a
.envfile in the root directory:cp env.example .env
Configure API keys, model names, and other settings by editing the .env file:
# OpenAI Configuration
OPENAI__API_KEY=your_openai_api_key_here
OPENAI__MODEL=gpt-4o-mini
OPENAI__TEMPERATURE=0.0
OPENAI__MAX_TOKENS=1500
# Neo4j Configuration
NEO4J__URI=bolt://localhost:7687
NEO4J__USERNAME=neo4j
NEO4J__PASSWORD=your_neo4j_password
NEO4J__DATABASE=neo4j
# Qdrant Configuration
QDRANT__URL=http://localhost:6333
QDRANT__API_KEY=your_qdrant_api_key
QDRANT__COLLECTION_NAME=biomedical_papers
QDRANT__EMBEDDING_MODEL=text-embedding-3-small
QDRANT__EMBEDDING_DIMENSION=1536
# PubMed Configuration (optional)
[email protected]
PUBMED__API_KEY=your_pubmed_api_key
# Data Paths
JSON_DATA__PUBMED_JSON_PATH=data/pubmed_dataset.json
JSON_DATA__GENE_JSON_PATH=data/gene_dataset.jsonThe system includes data collectors for biomedical and gene datasets:
# Collect PubMed papers and metadata
make pubmed-data-collector-run# Collect gene information related to the pubmed dataset
make gene-data-collector-run# Create the knowledge graph from datasets
make create-graph
# Delete all graph data (clean slate)
make delete-graph# Create vector collection for embeddings
make create-qdrant-collection
# Ingest embeddings into Qdrant
make ingest-qdrant-data
# Delete vector collection
make delete-qdrant-collection# Run a custom query on the Qdrant vector store
make custom-qdrant-query QUESTION="Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?"
# Or run directly with the CLI
uv run src/biomedical_graphrag/application/cli/query_vectorstore.py --ask "Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?"# Run example queries on the Neo4j graph using GraphRAG
make example-graph-query
# Run a custom natural language query using hybrid GraphRAG
make custom-graph-query QUESTION="What are the latest research trends in cancer immunotherapy?"
# Or run directly with the CLI
uv run src/biomedical_graphrag/application/cli/fusion_query.py "What are the latest research trends in cancer immunotherapy?"Qdrant Queries:
- Semantic search across paper abstracts and content
- Similarity-based retrieval using embeddings
- Direct vector similarity queries
Hybrid Queries:
- Combines semantic search (Qdrant) with graph enrichment (Neo4j):
- Author collaboration networks
- Citation analysis and paper relationships
- Gene-paper associations
- MeSH term relationships
- Institution affiliations
- LLM-powered automatic tool selection
-
Who collaborates with Jennifer Doudna on CRISPR research? Which researchers work with Emmanuelle Charpentier on gene editing or genome engineering papers?
-
Who are George Church’s collaborators publishing on synthetic biology and genome sequencing?
-
List scientists collaborating with Feng Zhang on neuroscience studies
-
Which papers are related to PMID 31295471 based on shared MeSH terms?
-
Find papers similar to the CRISPR-Cas9 genome editing study with PMID 31295471
-
Show other studies linked by MeSH terms to PMID 27562951
-
Which genes are mentioned in the same papers as gag?
-
What genes appear together with HIF1A in cancer research?
-
Which genes are frequently co-mentioned with TP53?
Run all tests:
make testsRun all quality checks (lint, format, type check, clean):
make all-check
make all-fixIndividual Commands:
-
Display all available commands:
make help -
Check code static typing
make mypy
-
Clean cache and build files:
make clean
This project is licensed under the MIT License - see the LICENSE file for details.
