Skip to content

Transform pharmaceutical pipeline PDFs into queryable Neo4j graphs. Vision LLM extraction of molecules, trials, mechanisms, and partnerships. Two-layer architecture with full provenance tracking. Demo GraphRAG superiority over vector-only RAG for complex business queries.

Notifications You must be signed in to change notification settings

neo4j-field/pharma-pipeline-KG-creation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pharmaceutical Pipeline Knowledge Graph

Python 3.10+ Neo4j 5.x License: MIT OpenAI

Extract structured knowledge from pharmaceutical pipeline PDFs and build a queryable Neo4j graph database.

This project demonstrates how to use Vision LLMs + GraphRAG to transform unstructured pharma documents into a knowledge graph that enables complex competitive intelligence queries impossible with traditional vector-only RAG.

🎯 Why This Matters

Pharmaceutical companies release quarterly pipeline updates as PDF presentations. These contain rich information about:

  • Molecules in development (brands, generics, mechanisms)
  • Clinical trials (phases, indications, timelines)
  • Partnerships and licensing deals
  • Competitive landscapes by therapeutic area

The Problem: This data is locked in complex slide layouts (tables, timelines, nested information, images) that are hard to analyze at scale.

The Solution: Extract entities and relationships into a Neo4j graph, enabling queries like:

  • "Which therapeutic mechanisms are being pursued by multiple companies, and which companies are competing in the same mechanism space?"
  • "Which individual molecules are most frequently used as components in combination therapies, and what roles do they play?"
  • "Which molecules are being developed for multiple diseases across different therapeutic areas, indicating broad platform potential?"

⚑ Quick Build - Full Pipeline

Build the complete knowledge graph in ~4 minutes:

# 1. Setup (one-time)
python scripts/setup_db.py

# 2. Build full graph (recommended for first run)
python scripts/ingest_document.py --all                    # ~30 seconds (102 pages)
python scripts/extract_entities.py --all --parallel 30      # ~3-4 minutes (344 molecules)
python scripts/embed_pages.py --all                         # ~5 seconds (102 embeddings) [OPTIONAL]
python scripts/postprocess_phase3b.py --all                 # ~30 seconds (454 relationships) [RECOMMENDED]



# Done! Query your knowledge graph in Neo4j Browser

a dump of the database is also available here (560Mb)


πŸ“ Project Structure

pharma-pipeline/
β”œβ”€β”€ data/                          # PDF documents
β”œβ”€β”€ documents_metadata.csv         # Document metadata
β”œβ”€β”€ pipeline/                      # Core pipeline modules
β”‚   β”œβ”€β”€ config.py                 # Configuration management
β”‚   β”œβ”€β”€ pdf_processor.py          # PDF β†’ text + images
β”‚   β”œβ”€β”€ neo4j_loader.py           # Load data into Neo4j
β”‚   β”œβ”€β”€ llm_extractor.py          # LLM-based extraction (Phase 2)
β”‚   └── embedder.py               # Page embeddings (Phase 2.5)
β”œβ”€β”€ scripts/                       # CLI scripts
β”‚   β”œβ”€β”€ setup_db.py               # Initialize Neo4j schema
β”‚   β”œβ”€β”€ ingest_document.py        # Phase 1: Ingest documents (lexical graph)
β”‚   β”œβ”€β”€ extract_entities.py       # Phase 2: Extract entities (parallel)
β”‚   β”œβ”€β”€ embed_pages.py            # Phase 2.5: Create embeddings (optional)
β”‚   β”œβ”€β”€ postprocess_phase3b.py   # Phase 3: PostProcessing relationships (recommended)
β”‚   └── test_extraction.py        # Test extraction on single page
β”œβ”€β”€ logs/                          # Log files
β”œβ”€β”€ failed_extractions/            # Failed extraction records
└── requirements.txt               # Python dependencies

πŸš€ Quick Start

1. Prerequisites

  • Python 3.10+
  • Neo4j 5.x (running locally or Neo4j Aura)
  • OpenAI API key with GPT-5-mini access (or GPT-4o/GPT-4o-mini)
  • Tier 5 recommended for optimal parallel processing (30+ concurrent)

2. Installation

# Clone/navigate to project
cd pharma-pipeline

# Install dependencies with uv
uv pip install -r requirements.txt

3. Configuration

Create a .env file in the project root:

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE=neo4j

# OpenAI Configuration
OPENAI_API_KEY=sk-...
EXTRACT_MODEL=gpt-5-mini           # For entity extraction
EMBEDDING_MODEL=text-embedding-3-small  # For page embeddings (optional)

# Optional: Logging
LOG_LEVEL=INFO

Model Recommendations:

  • gpt-5-mini: Best balance of speed/cost/quality for extraction (recommended)
  • text-embedding-3-small: Fast, cheap embeddings (1536 dims) - to demonstrate VectorRAG vs GraphRAG

See env.template for full configuration options.

4. Initialize Database

Run once to create constraints and indexes:

python scripts/setup_db.py

5. Ingest Documents (Phase 1: Lexical Graph)

Single document:

python scripts/ingest_document.py --document-id abbvie_pipeline_2024

All documents:

python scripts/ingest_document.py --all

Force re-ingest:

python scripts/ingest_document.py --document-id abbvie_pipeline_2024 --force

6. Extract Entities (Phase 2: Entity Extraction)

All documents with page-level parallel processing (RECOMMENDED):

python scripts/extract_entities.py --all --parallel 30

Single document:

python scripts/extract_entities.py --document-id abbvie_pipeline_2024

Extract specific pages:

python scripts/extract_entities.py --document-id abbvie_pipeline_2024 --pages 5,6,7

Performance:

  • With --parallel 30, expect ~25-30 pages/minute depending on page density
  • Multi-document processing: All 102 pages across 5 documents process simultaneously
  • Recent run: 102 pages in 3.8 minutes = 26.75 pages/min with 0 failures ✨
  • Single document: Same performance as before

7. Generate Page Embeddings (Phase 2.5: Optional)

Create embeddings for all pages:

python scripts/embed_pages.py --all

Single document:

python scripts/embed_pages.py --document-id abbvie_pipeline_2024

Force re-embed (if you change embedding model):

python scripts/embed_pages.py --all --force

Control parallelism:

python scripts/embed_pages.py --all --parallel 50

Why embeddings?

  • Optional step for comparing GraphRAG vs traditional vector-only RAG
  • Fast: ~1000 pages/minute (vs 10-30 for extraction)
  • Cheap: ~$0.00001 per page (vs ~$0.03 for extraction)

8. Post-Processing: Entity Resolution & Enrichment (Phase 3b: Recommended)

Run all post-processing steps:

python scripts/postprocess_phase3b.py --all

What it does:

  • βœ… Creates DEVELOPS relationships: (Company)-[:DEVELOPS]->(Molecule|Combination)
  • βœ… Creates SAME_AS relationships: Links brand ↔ generic names and case variants
  • βœ… Creates INVOLVES relationships: Links partnerships to companies
  • βœ… Normalizes TherapeuticArea: Merges singular/plural variants

Preview changes (dry run):

python scripts/postprocess_phase3b.py --dry-run --all

Run specific steps:

# Only DEVELOPS relationships
python scripts/postprocess_phase3b.py --develops

# Only SAME_AS relationships
python scripts/postprocess_phase3b.py --same-as

# SAME_AS + Partnership linking
python scripts/postprocess_phase3b.py --same-as --partnerships

Verify existing relationships:

python scripts/postprocess_phase3b.py --verify

9. Query Your Graph with Claude Desktop (MCP Server) ⭐ RECOMMENDED

Use Natural Language to Query Your Graph!

Instead of writing Cypher queries, connect Claude Desktop to your Neo4j database using the Model Context Protocol (MCP) and ask questions in natural language.

What is MCP?

The Model Context Protocol allows LLM applications (like Claude Desktop) to connect to external tools and data sources. The Neo4j MCP server gives Claude the ability to:

  • Read your graph schema
  • Execute Cypher queries
  • Answer complex questions about your pharmaceutical pipeline

Quick Setup

  1. Install Neo4j MCP Server (if not already installed)

    # Follow instructions at:
    # https://github.com/neo4j-contrib/mcp-neo4j
  2. Configure Claude Desktop

    Add to your Claude Desktop config file:

    macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
    Windows: %APPDATA%\Claude\claude_desktop_config.json

    {
      "mcpServers": {
        "neo4j": {
          "command": "neo4j-mcp",
          "env": {
            "NEO4J_URI": "bolt://localhost:7687",
            "NEO4J_USERNAME": "neo4j",
            "NEO4J_PASSWORD": "your_password",
            "NEO4J_DATABASE": "neo4j"
          }
        }
      }
    }
  3. Restart Claude Desktop

  4. Ask Questions!

    Try these in Claude:

    "What molecules is AbbVie developing?"   
    
    "Which companies are developing JAK inhibitors?"
    
    "Find partnerships involving Bristol Myers Squibb"
    
    "What therapeutic areas have the most molecules in development?"
    

Learn More:

πŸ“Š Schema Overview

1. Lexical Graph (Phase 1)

Purpose: Capture document structure and enable provenance tracking.

The lexical graph models the physical structure of PDF documents:

  • Companies publish Documents (quarterly pipeline updates)
  • Documents contain Pages with extracted text and images
  • Pages link sequentially via NEXT relationships for navigation
graph LR
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Document["Document<br/>id: STRING | KEY<br/>name: STRING<br/>company: STRING<br/>date: DATE<br/>documentType: STRING<br/>location: STRING"]
Page["Page<br/>id: STRING | KEY<br/>pageNumber: INTEGER<br/>extractedText: STRING<br/>extractedImage: STRING<br/>embedding: LIST"]

%% Relationships
Company -->|PUBLISHED| Document
Document -->|HAS_PAGE| Page
Page -->|NEXT| Page


%% Styling 
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color

classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Document node_1_color

classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Page node_2_color
Loading

2. Entity Graph (Phase 2-3)

Purpose: Capture business entities, relationships, and competitive intelligence.

The entity graph models pharmaceutical pipeline data extracted from documents using Vision LLMs:

  • Drug entities: Molecules, combinations, mechanisms, targets, modalities
  • Disease context: Diseases, therapeutic areas, indications, biomarkers
  • Clinical development: Trials, milestones, geographic regions
  • Business relationships: Partnerships between companies
graph TD
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Molecule["Molecule<br/>name: STRING | KEY<br/>genericName: STRING<br/>brandName: STRING<br/>internalCode: STRING"]
Combination["Combination<br/>id: STRING | KEY<br/>name: STRING<br/>description: STRING"]
Modality["Modality<br/>name: STRING | KEY"]
Mechanism["Mechanism<br/>name: STRING | KEY"]
MechanismCategory["MechanismCategory<br/>name: STRING | KEY"]
Target["Target<br/>name: STRING | KEY"]
Disease["Disease<br/>name: STRING | KEY"]
TherapeuticArea["TherapeuticArea<br/>name: STRING | KEY"]
Indication["Indication<br/>id: STRING | KEY<br/>description: STRING<br/>lineOfTherapy: STRING<br/>setting: STRING"]
Biomarker["Biomarker<br/>name: STRING | KEY<br/>expression: STRING"]
ClinicalTrial["ClinicalTrial<br/>id: STRING | KEY<br/>nctId: STRING<br/>studyName: STRING<br/>phase: STRING<br/>status: STRING"]
Milestone["Milestone<br/>id: STRING | KEY<br/>type: STRING<br/>date: DATE<br/>status: STRING"]
Geography["Geography<br/>region: STRING | KEY<br/>fullName: STRING"]
Partnership["Partnership<br/>id: STRING | KEY<br/>partnerName: STRING<br/>type: STRING<br/>description: STRING"]

%% Relationships
Company -->|DEVELOPS| Molecule
Company -->|DEVELOPS| Combination
Molecule -->|HAS_MODALITY| Modality
Molecule -->|HAS_MECHANISM| Mechanism
Mechanism -->|IN_CATEGORY| MechanismCategory
Mechanism -->|TARGETS| Target
Molecule -->|TREATS| Indication
Combination -->|INCLUDES| Molecule
Indication -->|FOR_DISEASE| Disease
Disease -->|IN_THERAPEUTIC_AREA| TherapeuticArea
Indication -->|REQUIRES_BIOMARKER| Biomarker
Molecule -->|IN_TRIAL| ClinicalTrial
ClinicalTrial -->|IN_REGION| Geography
Milestone -->|IN_REGION| Geography
Company -->|PARTNERS_WITH| Partnership
Partnership -->|FOR_MOLECULE| Molecule
Molecule -->|SAME_AS| Molecule


%% Styling 
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color

classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Molecule node_1_color

classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Combination node_2_color

classDef node_3_color fill:#fff3e0,stroke:#f57c00,stroke-width:3px,color:#000,font-size:12px
class Modality node_3_color

classDef node_4_color fill:#fce4ec,stroke:#c2185b,stroke-width:3px,color:#000,font-size:12px
class Mechanism node_4_color

classDef node_5_color fill:#e0f2f1,stroke:#00695c,stroke-width:3px,color:#000,font-size:12px
class MechanismCategory node_5_color

classDef node_6_color fill:#f1f8e9,stroke:#689f38,stroke-width:3px,color:#000,font-size:12px
class Target node_6_color

classDef node_7_color fill:#fff8e1,stroke:#ffa000,stroke-width:3px,color:#000,font-size:12px
class Disease node_7_color

classDef node_8_color fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px,color:#000,font-size:12px
class TherapeuticArea node_8_color

classDef node_9_color fill:#efebe9,stroke:#5d4037,stroke-width:3px,color:#000,font-size:12px
class Indication node_9_color

classDef node_10_color fill:#fafafa,stroke:#424242,stroke-width:3px,color:#000,font-size:12px
class Biomarker node_10_color

classDef node_11_color fill:#e1f5fe,stroke:#0277bd,stroke-width:3px,color:#000,font-size:12px
class ClinicalTrial node_11_color

classDef node_12_color fill:#f9fbe7,stroke:#827717,stroke-width:3px,color:#000,font-size:12px
class Milestone node_12_color

classDef node_13_color fill:#fff1f0,stroke:#d32f2f,stroke-width:3px,color:#000,font-size:12px
class Geography node_13_color

classDef node_14_color fill:#f4e6ff,stroke:#6a1b9a,stroke-width:3px,color:#000,font-size:12px
class Partnership node_14_color
Loading

πŸ”„ Pipeline Phases

Phase 1: Lexical Graph βœ… COMPLETE

  • βœ… Parse PDF documents
  • βœ… Extract page text and images
  • βœ… Create Document β†’ Page structure in Neo4j
  • βœ… Sequential page navigation with NEXT relationships
  • Result: 102 pages loaded across 5 documents (AbbVie, Bristol Myers Squibb, Bayer, J&J, Pfizer)

Phase 2: Entity Extraction βœ… COMPLETE

  • βœ… Vision LLM (GPT-5-mini) with text + image input
  • βœ… Structured output with Pydantic validation
  • βœ… Extract: Molecules, Diseases, Trials, Partnerships, Mechanisms, Targets, etc.
  • βœ… Page-level parallel processing (30 concurrent pages across ALL documents)
  • βœ… Full provenance tracking (EXTRACTED_FROM relationships)
  • Result: 344 molecules, 503 indications, 582 treatment relationships in 3.8 minutes (26.75 pages/min)
  • Optimization: Pages from multiple documents processed simultaneously for 2-4x faster batch processing

Phase 3: Post-Processing & Optimization βœ… COMPLETE

Page Embeddings (OPTIONAL)

  • βœ… Generate embeddings for page text using OpenAI embeddings API
  • βœ… Derive direct Companyβ†’Molecule/Combination relationships from provenance
  • βœ… Normalize entity names across documents (e.g., "Opdivo" vs "nivolumab")
  • βœ… Create SAME_AS relationships or merge duplicate nodes
  • Post-processing with ontologies (MONDO, PubChem, UniProt, etc.) (NEXT)

Phase 4: Query Layer & RAG πŸ”„ NEXT

  • Build Cypher query templates for strategic questions
  • Text-to-Cypher agent (LangChain)
  • Hybrid vector + graph RAG
  • Streamlit exploration UI
  • Competitive intelligence dashboard

πŸ“ Document Metadata

Edit documents_metadata.csv to add new documents:

documentId,company,name,date,documentType,location,url
abbvie_pipeline_2024,AbbVie,AbbVie Pipeline Update,2024-02-02,Pipeline Update,data/AbbVie.pdf,

πŸ” Example Queries

Entity Queries (Phase 2)

// Top molecules by number of indications
MATCH (m:Molecule)-[:TREATS]->(i:Indication)
WITH m, count(DISTINCT i) as indications
RETURN m.name, m.brandName, indications
ORDER BY indications DESC
LIMIT 10

// Find competitive landscape for a disease
MATCH (d:Disease {name: "Non-Small Cell Lung Cancer"})<-[:FOR_DISEASE]-(i:Indication)<-[:TREATS]-(m:Molecule)
OPTIONAL MATCH (m)-[:IN_TRIAL]->(ct:ClinicalTrial)
RETURN m.name, m.brandName, 
       count(DISTINCT i) as indications,
       collect(DISTINCT ct.phase) as trial_phases
ORDER BY indications DESC

// Molecules with mechanisms and targets
MATCH (m:Molecule)-[:HAS_MECHANISM]->(mech:Mechanism)-[:TARGETS]->(t:Target)
RETURN m.name, mech.name, collect(t.name) as targets
LIMIT 10

// Partnership analysis
MATCH (p:Partnership)-[:FOR_MOLECULE]->(m:Molecule)
MATCH (p)-[:PARTNERS_WITH]->(partner:Company)
RETURN m.name, collect(partner.name) as partners
ORDER BY size(partners) DESC
LIMIT 10

About

Transform pharmaceutical pipeline PDFs into queryable Neo4j graphs. Vision LLM extraction of molecules, trials, mechanisms, and partnerships. Two-layer architecture with full provenance tracking. Demo GraphRAG superiority over vector-only RAG for complex business queries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages