Skip to content

benitomartin/biomedical-graphrag

Repository files navigation

Biomedical GraphRAG

Neo4j UI

License: MIT Python version uv

Qdrant Neo4j OpenAI

Table of Contents

Overview

A comprehensive GraphRAG (Graph Retrieval-Augmented Generation) system designed for biomedical research. It combines knowledge graphs with vector search to provide intelligent querying and analysis of biomedical literature and genomic data.

Article: Building a Biomedical GraphRAG: When Knowledge Graphs Meet Vector Search

Key Features:

  • Hybrid Query System: Combines Neo4j graph database with Qdrant vector search for comprehensive biomedical insights
  • Data Integration: Processes PubMed papers, gene data, and research citations
  • Intelligent Querying: Uses LLM-powered tool selection for graph enrichment and semantic search
  • Biomedical Schema: Specialized graph schema for papers, authors, institutions, genes, and MeSH terms
  • Async Processing: High-performance async data collection and processing

Project Structure

biomedical-graphrag-pipeline/
├── .github/                    # GitHub workflows and templates
├── data/                       # Dataset storage (PubMed, Gene data)
├── docs/                       # Documentation
├── src/
│   └── biomedical_graphrag/
│       ├── application/        # Application layer
│       │   ├── cli/            # Command-line interfaces
│       │   └── services/       # Business logic services
│       ├── config.py           # Configuration management
│       ├── data_sources/       # Data collection modules
│       ├── domain/             # Domain models and entities
│       ├── infrastructure/     # Database and external service adapters
│       └── utils/              # Utility functions
├── static/                     # Static assets (images, etc.)
├── tests/                      # Test suite
├── LICENSE                     # MIT License
├── Makefile                    # Build and development commands
├── pyproject.toml              # Project configuration and dependencies
├── README.md                   # This file
└── uv.lock                     # Dependency lock file

Prerequisites

Requirement Description
Python 3.13+ Programming language
uv Package and dependency manager
Neo4j Graph database for knowledge graphs
Qdrant Vector database for embeddings
OpenAI LLM provider for queries and embeddings
PubMed Biomedical literature database

Installation

  1. Clone the repository:

    git clone [email protected]:benitomartin/biomedical-graphrag.git
    cd biomedical-graphrag
  2. Create a virtual environment:

    uv venv
  3. Activate the virtual environment:

    source .venv/bin/activate
  4. Install the required packages:

    uv sync --all-groups --all-extra
  5. Create a .env file in the root directory:

     cp env.example .env

Usage

Configuration

Configure API keys, model names, and other settings by editing the .env file:

# OpenAI Configuration
OPENAI__API_KEY=your_openai_api_key_here
OPENAI__MODEL=gpt-4o-mini
OPENAI__TEMPERATURE=0.0
OPENAI__MAX_TOKENS=1500

# Neo4j Configuration
NEO4J__URI=bolt://localhost:7687
NEO4J__USERNAME=neo4j
NEO4J__PASSWORD=your_neo4j_password
NEO4J__DATABASE=neo4j

# Qdrant Configuration
QDRANT__URL=http://localhost:6333
QDRANT__API_KEY=your_qdrant_api_key
QDRANT__COLLECTION_NAME=biomedical_papers
QDRANT__EMBEDDING_MODEL=text-embedding-3-small
QDRANT__EMBEDDING_DIMENSION=1536

# PubMed Configuration (optional)
[email protected]
PUBMED__API_KEY=your_pubmed_api_key

# Data Paths
JSON_DATA__PUBMED_JSON_PATH=data/pubmed_dataset.json
JSON_DATA__GENE_JSON_PATH=data/gene_dataset.json

Data Collection

The system includes data collectors for biomedical and gene datasets:

# Collect PubMed papers and metadata
make pubmed-data-collector-run
# Collect gene information related to the pubmed dataset
make gene-data-collector-run

Infrastructure Setup

Neo4j Graph Database

# Create the knowledge graph from datasets
make create-graph

# Delete all graph data (clean slate)
make delete-graph

Qdrant Vector Database

# Create vector collection for embeddings
make create-qdrant-collection

# Ingest embeddings into Qdrant
make ingest-qdrant-data

# Delete vector collection
make delete-qdrant-collection

Query Commands

Qdrant Vector Search

# Run a custom query on the Qdrant vector store
make custom-qdrant-query QUESTION="Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?"

# Or run directly with the CLI
uv run src/biomedical_graphrag/application/cli/query_vectorstore.py --ask "Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?"

Hybrid Neo4j + Qdrant Queries

# Run example queries on the Neo4j graph using GraphRAG
make example-graph-query

# Run a custom natural language query using hybrid GraphRAG
make custom-graph-query QUESTION="What are the latest research trends in cancer immunotherapy?"

# Or run directly with the CLI
uv run src/biomedical_graphrag/application/cli/fusion_query.py "What are the latest research trends in cancer immunotherapy?"

Available Query Types

Qdrant Queries:

  • Semantic search across paper abstracts and content
  • Similarity-based retrieval using embeddings
  • Direct vector similarity queries

Hybrid Queries:

  • Combines semantic search (Qdrant) with graph enrichment (Neo4j):
    • Author collaboration networks
    • Citation analysis and paper relationships
    • Gene-paper associations
    • MeSH term relationships
    • Institution affiliations
  • LLM-powered automatic tool selection

Sample Queries

  • Who collaborates with Jennifer Doudna on CRISPR research? Which researchers work with Emmanuelle Charpentier on gene editing or genome engineering papers?

  • Who are George Church’s collaborators publishing on synthetic biology and genome sequencing?

  • List scientists collaborating with Feng Zhang on neuroscience studies

  • Which papers are related to PMID 31295471 based on shared MeSH terms?

  • Find papers similar to the CRISPR-Cas9 genome editing study with PMID 31295471

  • Show other studies linked by MeSH terms to PMID 27562951

  • Which genes are mentioned in the same papers as gag?

  • What genes appear together with HIF1A in cancer research?

  • Which genes are frequently co-mentioned with TP53?

Testing

Run all tests:

make tests

Quality Checks

Run all quality checks (lint, format, type check, clean):

make all-check
make all-fix

Individual Commands:

  • Display all available commands:

    make help
  • Check code static typing

    make mypy
  • Clean cache and build files:

    make clean

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A comprehensive GraphRAG (Graph Retrieval-Augmented Generation) system designed for biomedical research

Topics

Resources

License

Stars

Watchers

Forks