Pharmaceutical Pipeline Knowledge Graph

Extract structured knowledge from pharmaceutical pipeline PDFs and build a queryable Neo4j graph database.

This project demonstrates how to use Vision LLMs + GraphRAG to transform unstructured pharma documents into a knowledge graph that enables complex competitive intelligence queries impossible with traditional vector-only RAG.

🎯 Why This Matters

Pharmaceutical companies release quarterly pipeline updates as PDF presentations. These contain rich information about:

Molecules in development (brands, generics, mechanisms)
Clinical trials (phases, indications, timelines)
Partnerships and licensing deals
Competitive landscapes by therapeutic area

The Problem: This data is locked in complex slide layouts (tables, timelines, nested information, images) that are hard to analyze at scale.

The Solution: Extract entities and relationships into a Neo4j graph, enabling queries like:

"Which therapeutic mechanisms are being pursued by multiple companies, and which companies are competing in the same mechanism space?"
"Which individual molecules are most frequently used as components in combination therapies, and what roles do they play?"
"Which molecules are being developed for multiple diseases across different therapeutic areas, indicating broad platform potential?"

⚡ Quick Build - Full Pipeline

Build the complete knowledge graph in ~4 minutes:

# 1. Setup (one-time)
python scripts/setup_db.py

# 2. Build full graph (recommended for first run)
python scripts/ingest_document.py --all                    # ~30 seconds (102 pages)
python scripts/extract_entities.py --all --parallel 30      # ~3-4 minutes (344 molecules)
python scripts/embed_pages.py --all                         # ~5 seconds (102 embeddings) [OPTIONAL]
python scripts/postprocess_phase3b.py --all                 # ~30 seconds (454 relationships) [RECOMMENDED]



# Done! Query your knowledge graph in Neo4j Browser

a dump of the database is also available here (560Mb)

📁 Project Structure

pharma-pipeline/
├── data/                          # PDF documents
├── documents_metadata.csv         # Document metadata
├── pipeline/                      # Core pipeline modules
│   ├── config.py                 # Configuration management
│   ├── pdf_processor.py          # PDF → text + images
│   ├── neo4j_loader.py           # Load data into Neo4j
│   ├── llm_extractor.py          # LLM-based extraction (Phase 2)
│   └── embedder.py               # Page embeddings (Phase 2.5)
├── scripts/                       # CLI scripts
│   ├── setup_db.py               # Initialize Neo4j schema
│   ├── ingest_document.py        # Phase 1: Ingest documents (lexical graph)
│   ├── extract_entities.py       # Phase 2: Extract entities (parallel)
│   ├── embed_pages.py            # Phase 2.5: Create embeddings (optional)
│   ├── postprocess_phase3b.py   # Phase 3: PostProcessing relationships (recommended)
│   └── test_extraction.py        # Test extraction on single page
├── logs/                          # Log files
├── failed_extractions/            # Failed extraction records
└── requirements.txt               # Python dependencies

🚀 Quick Start

1. Prerequisites

Python 3.10+
Neo4j 5.x (running locally or Neo4j Aura)
OpenAI API key with GPT-5-mini access (or GPT-4o/GPT-4o-mini)
Tier 5 recommended for optimal parallel processing (30+ concurrent)

2. Installation

# Clone/navigate to project
cd pharma-pipeline

# Install dependencies with uv
uv pip install -r requirements.txt

3. Configuration

Create a .env file in the project root:

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE=neo4j

# OpenAI Configuration
OPENAI_API_KEY=sk-...
EXTRACT_MODEL=gpt-5-mini           # For entity extraction
EMBEDDING_MODEL=text-embedding-3-small  # For page embeddings (optional)

# Optional: Logging
LOG_LEVEL=INFO

Model Recommendations:

gpt-5-mini: Best balance of speed/cost/quality for extraction (recommended)
text-embedding-3-small: Fast, cheap embeddings (1536 dims) - to demonstrate VectorRAG vs GraphRAG

See env.template for full configuration options.

4. Initialize Database

Run once to create constraints and indexes:

python scripts/setup_db.py

5. Ingest Documents (Phase 1: Lexical Graph)

Single document:

python scripts/ingest_document.py --document-id abbvie_pipeline_2024

All documents:

python scripts/ingest_document.py --all

Force re-ingest:

python scripts/ingest_document.py --document-id abbvie_pipeline_2024 --force

6. Extract Entities (Phase 2: Entity Extraction)

All documents with page-level parallel processing (RECOMMENDED):

python scripts/extract_entities.py --all --parallel 30

Single document:

python scripts/extract_entities.py --document-id abbvie_pipeline_2024

Extract specific pages:

python scripts/extract_entities.py --document-id abbvie_pipeline_2024 --pages 5,6,7

Performance:

With --parallel 30, expect ~25-30 pages/minute depending on page density
Multi-document processing: All 102 pages across 5 documents process simultaneously
Recent run: 102 pages in 3.8 minutes = 26.75 pages/min with 0 failures ✨
Single document: Same performance as before

7. Generate Page Embeddings (Phase 2.5: Optional)

Create embeddings for all pages:

python scripts/embed_pages.py --all

Single document:

python scripts/embed_pages.py --document-id abbvie_pipeline_2024

Force re-embed (if you change embedding model):

python scripts/embed_pages.py --all --force

Control parallelism:

python scripts/embed_pages.py --all --parallel 50

Why embeddings?

Optional step for comparing GraphRAG vs traditional vector-only RAG
Fast: ~1000 pages/minute (vs 10-30 for extraction)
Cheap: ~$0.00001 per page (vs ~$0.03 for extraction)

8. Post-Processing: Entity Resolution & Enrichment (Phase 3b: Recommended)

Run all post-processing steps:

python scripts/postprocess_phase3b.py --all

What it does:

✅ Creates DEVELOPS relationships: (Company)-[:DEVELOPS]->(Molecule|Combination)
✅ Creates SAME_AS relationships: Links brand ↔ generic names and case variants
✅ Creates INVOLVES relationships: Links partnerships to companies
✅ Normalizes TherapeuticArea: Merges singular/plural variants

Preview changes (dry run):

python scripts/postprocess_phase3b.py --dry-run --all

Run specific steps:

# Only DEVELOPS relationships
python scripts/postprocess_phase3b.py --develops

# Only SAME_AS relationships
python scripts/postprocess_phase3b.py --same-as

# SAME_AS + Partnership linking
python scripts/postprocess_phase3b.py --same-as --partnerships

Verify existing relationships:

python scripts/postprocess_phase3b.py --verify

9. Query Your Graph with Claude Desktop (MCP Server) ⭐ RECOMMENDED

Use Natural Language to Query Your Graph!

Instead of writing Cypher queries, connect Claude Desktop to your Neo4j database using the Model Context Protocol (MCP) and ask questions in natural language.

What is MCP?

The Model Context Protocol allows LLM applications (like Claude Desktop) to connect to external tools and data sources. The Neo4j MCP server gives Claude the ability to:

Read your graph schema
Execute Cypher queries
Answer complex questions about your pharmaceutical pipeline

Quick Setup

Install Neo4j MCP Server (if not already installed)

# Follow instructions at:
# https://github.com/neo4j-contrib/mcp-neo4j

Configure Claude Desktop

Add to your Claude Desktop config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "neo4j": {
      "command": "neo4j-mcp",
      "env": {
        "NEO4J_URI": "bolt://localhost:7687",
        "NEO4J_USERNAME": "neo4j",
        "NEO4J_PASSWORD": "your_password",
        "NEO4J_DATABASE": "neo4j"
      }
    }
  }
}

Restart Claude Desktop

Ask Questions!

Try these in Claude:

"What molecules is AbbVie developing?"   

"Which companies are developing JAK inhibitors?"

"Find partnerships involving Bristol Myers Squibb"

"What therapeutic areas have the most molecules in development?"

Learn More:

📊 Schema Overview

1. Lexical Graph (Phase 1)

Purpose: Capture document structure and enable provenance tracking.

The lexical graph models the physical structure of PDF documents:

Companies publish Documents (quarterly pipeline updates)
Documents contain Pages with extracted text and images
Pages link sequentially via NEXT relationships for navigation

graph LR
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Document["Document<br/>id: STRING | KEY<br/>name: STRING<br/>company: STRING<br/>date: DATE<br/>documentType: STRING<br/>location: STRING"]
Page["Page<br/>id: STRING | KEY<br/>pageNumber: INTEGER<br/>extractedText: STRING<br/>extractedImage: STRING<br/>embedding: LIST"]

%% Relationships
Company -->|PUBLISHED| Document
Document -->|HAS_PAGE| Page
Page -->|NEXT| Page


%% Styling 
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color

classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Document node_1_color

classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Page node_2_color

2. Entity Graph (Phase 2-3)

Purpose: Capture business entities, relationships, and competitive intelligence.

The entity graph models pharmaceutical pipeline data extracted from documents using Vision LLMs:

Drug entities: Molecules, combinations, mechanisms, targets, modalities
Disease context: Diseases, therapeutic areas, indications, biomarkers
Clinical development: Trials, milestones, geographic regions
Business relationships: Partnerships between companies

graph TD
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Molecule["Molecule<br/>name: STRING | KEY<br/>genericName: STRING<br/>brandName: STRING<br/>internalCode: STRING"]
Combination["Combination<br/>id: STRING | KEY<br/>name: STRING<br/>description: STRING"]
Modality["Modality<br/>name: STRING | KEY"]
Mechanism["Mechanism<br/>name: STRING | KEY"]
MechanismCategory["MechanismCategory<br/>name: STRING | KEY"]
Target["Target<br/>name: STRING | KEY"]
Disease["Disease<br/>name: STRING | KEY"]
TherapeuticArea["TherapeuticArea<br/>name: STRING | KEY"]
Indication["Indication<br/>id: STRING | KEY<br/>description: STRING<br/>lineOfTherapy: STRING<br/>setting: STRING"]
Biomarker["Biomarker<br/>name: STRING | KEY<br/>expression: STRING"]
ClinicalTrial["ClinicalTrial<br/>id: STRING | KEY<br/>nctId: STRING<br/>studyName: STRING<br/>phase: STRING<br/>status: STRING"]
Milestone["Milestone<br/>id: STRING | KEY<br/>type: STRING<br/>date: DATE<br/>status: STRING"]
Geography["Geography<br/>region: STRING | KEY<br/>fullName: STRING"]
Partnership["Partnership<br/>id: STRING | KEY<br/>partnerName: STRING<br/>type: STRING<br/>description: STRING"]

%% Relationships
Company -->|DEVELOPS| Molecule
Company -->|DEVELOPS| Combination
Molecule -->|HAS_MODALITY| Modality
Molecule -->|HAS_MECHANISM| Mechanism
Mechanism -->|IN_CATEGORY| MechanismCategory
Mechanism -->|TARGETS| Target
Molecule -->|TREATS| Indication
Combination -->|INCLUDES| Molecule
Indication -->|FOR_DISEASE| Disease
Disease -->|IN_THERAPEUTIC_AREA| TherapeuticArea
Indication -->|REQUIRES_BIOMARKER| Biomarker
Molecule -->|IN_TRIAL| ClinicalTrial
ClinicalTrial -->|IN_REGION| Geography
Milestone -->|IN_REGION| Geography
Company -->|PARTNERS_WITH| Partnership
Partnership -->|FOR_MOLECULE| Molecule
Molecule -->|SAME_AS| Molecule


%% Styling 
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color

classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Molecule node_1_color

classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Combination node_2_color

classDef node_3_color fill:#fff3e0,stroke:#f57c00,stroke-width:3px,color:#000,font-size:12px
class Modality node_3_color

classDef node_4_color fill:#fce4ec,stroke:#c2185b,stroke-width:3px,color:#000,font-size:12px
class Mechanism node_4_color

classDef node_5_color fill:#e0f2f1,stroke:#00695c,stroke-width:3px,color:#000,font-size:12px
class MechanismCategory node_5_color

classDef node_6_color fill:#f1f8e9,stroke:#689f38,stroke-width:3px,color:#000,font-size:12px
class Target node_6_color

classDef node_7_color fill:#fff8e1,stroke:#ffa000,stroke-width:3px,color:#000,font-size:12px
class Disease node_7_color

classDef node_8_color fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px,color:#000,font-size:12px
class TherapeuticArea node_8_color

classDef node_9_color fill:#efebe9,stroke:#5d4037,stroke-width:3px,color:#000,font-size:12px
class Indication node_9_color

classDef node_10_color fill:#fafafa,stroke:#424242,stroke-width:3px,color:#000,font-size:12px
class Biomarker node_10_color

classDef node_11_color fill:#e1f5fe,stroke:#0277bd,stroke-width:3px,color:#000,font-size:12px
class ClinicalTrial node_11_color

classDef node_12_color fill:#f9fbe7,stroke:#827717,stroke-width:3px,color:#000,font-size:12px
class Milestone node_12_color

classDef node_13_color fill:#fff1f0,stroke:#d32f2f,stroke-width:3px,color:#000,font-size:12px
class Geography node_13_color

classDef node_14_color fill:#f4e6ff,stroke:#6a1b9a,stroke-width:3px,color:#000,font-size:12px
class Partnership node_14_color

🔄 Pipeline Phases

Phase 1: Lexical Graph ✅ COMPLETE

✅ Parse PDF documents
✅ Extract page text and images
✅ Create Document → Page structure in Neo4j
✅ Sequential page navigation with NEXT relationships
Result: 102 pages loaded across 5 documents (AbbVie, Bristol Myers Squibb, Bayer, J&J, Pfizer)

Phase 2: Entity Extraction ✅ COMPLETE

✅ Vision LLM (GPT-5-mini) with text + image input
✅ Structured output with Pydantic validation
✅ Extract: Molecules, Diseases, Trials, Partnerships, Mechanisms, Targets, etc.
✅ Page-level parallel processing (30 concurrent pages across ALL documents)
✅ Full provenance tracking (EXTRACTED_FROM relationships)
Result: 344 molecules, 503 indications, 582 treatment relationships in 3.8 minutes (26.75 pages/min)
Optimization: Pages from multiple documents processed simultaneously for 2-4x faster batch processing

Phase 3: Post-Processing & Optimization ✅ COMPLETE

Page Embeddings (OPTIONAL)

✅ Generate embeddings for page text using OpenAI embeddings API
✅ Derive direct Company→Molecule/Combination relationships from provenance
✅ Normalize entity names across documents (e.g., "Opdivo" vs "nivolumab")
✅ Create SAME_AS relationships or merge duplicate nodes
Post-processing with ontologies (MONDO, PubChem, UniProt, etc.) (NEXT)

Phase 4: Query Layer & RAG 🔄 NEXT

Build Cypher query templates for strategic questions
Text-to-Cypher agent (LangChain)
Hybrid vector + graph RAG
Streamlit exploration UI
Competitive intelligence dashboard

📝 Document Metadata

Edit documents_metadata.csv to add new documents:

documentId,company,name,date,documentType,location,url
abbvie_pipeline_2024,AbbVie,AbbVie Pipeline Update,2024-02-02,Pipeline Update,data/AbbVie.pdf,

🔍 Example Queries

Entity Queries (Phase 2)

// Top molecules by number of indications
MATCH (m:Molecule)-[:TREATS]->(i:Indication)
WITH m, count(DISTINCT i) as indications
RETURN m.name, m.brandName, indications
ORDER BY indications DESC
LIMIT 10

// Find competitive landscape for a disease
MATCH (d:Disease {name: "Non-Small Cell Lung Cancer"})<-[:FOR_DISEASE]-(i:Indication)<-[:TREATS]-(m:Molecule)
OPTIONAL MATCH (m)-[:IN_TRIAL]->(ct:ClinicalTrial)
RETURN m.name, m.brandName, 
       count(DISTINCT i) as indications,
       collect(DISTINCT ct.phase) as trial_phases
ORDER BY indications DESC

// Molecules with mechanisms and targets
MATCH (m:Molecule)-[:HAS_MECHANISM]->(mech:Mechanism)-[:TARGETS]->(t:Target)
RETURN m.name, mech.name, collect(t.name) as targets
LIMIT 10

// Partnership analysis
MATCH (p:Partnership)-[:FOR_MOLECULE]->(m:Molecule)
MATCH (p)-[:PARTNERS_WITH]->(partner:Company)
RETURN m.name, collect(partner.name) as partners
ORDER BY size(partners) DESC
LIMIT 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pharmaceutical Pipeline Knowledge Graph

🎯 Why This Matters

⚡ Quick Build - Full Pipeline

📁 Project Structure

🚀 Quick Start

1. Prerequisites

2. Installation

3. Configuration

4. Initialize Database

5. Ingest Documents (Phase 1: Lexical Graph)

6. Extract Entities (Phase 2: Entity Extraction)

7. Generate Page Embeddings (Phase 2.5: Optional)

8. Post-Processing: Entity Resolution & Enrichment (Phase 3b: Recommended)

9. Query Your Graph with Claude Desktop (MCP Server) ⭐ RECOMMENDED

What is MCP?

Quick Setup

📊 Schema Overview

1. Lexical Graph (Phase 1)

2. Entity Graph (Phase 2-3)

🔄 Pipeline Phases

Phase 1: Lexical Graph ✅ COMPLETE

Phase 2: Entity Extraction ✅ COMPLETE

Phase 3: Post-Processing & Optimization ✅ COMPLETE

Page Embeddings (OPTIONAL)

Phase 4: Query Layer & RAG 🔄 NEXT

📝 Document Metadata

🔍 Example Queries

Entity Queries (Phase 2)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
pipeline		pipeline
scripts		scripts
.gitignore		.gitignore
README.md		README.md
documents_metadata.csv		documents_metadata.csv
env.template		env.template
requirements.txt		requirements.txt

neo4j-field/pharma-pipeline-KG-creation

Folders and files

Latest commit

History

Repository files navigation

Pharmaceutical Pipeline Knowledge Graph

🎯 Why This Matters

⚡ Quick Build - Full Pipeline

📁 Project Structure

🚀 Quick Start

1. Prerequisites

2. Installation

3. Configuration

4. Initialize Database

5. Ingest Documents (Phase 1: Lexical Graph)

6. Extract Entities (Phase 2: Entity Extraction)

7. Generate Page Embeddings (Phase 2.5: Optional)

8. Post-Processing: Entity Resolution & Enrichment (Phase 3b: Recommended)

9. Query Your Graph with Claude Desktop (MCP Server) ⭐ RECOMMENDED

What is MCP?

Quick Setup

📊 Schema Overview

1. Lexical Graph (Phase 1)

2. Entity Graph (Phase 2-3)

🔄 Pipeline Phases

Phase 1: Lexical Graph ✅ COMPLETE

Phase 2: Entity Extraction ✅ COMPLETE

Phase 3: Post-Processing & Optimization ✅ COMPLETE

Page Embeddings (OPTIONAL)

Phase 4: Query Layer & RAG 🔄 NEXT

📝 Document Metadata

🔍 Example Queries

Entity Queries (Phase 2)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages