Extract structured knowledge from pharmaceutical pipeline PDFs and build a queryable Neo4j graph database.
This project demonstrates how to use Vision LLMs + GraphRAG to transform unstructured pharma documents into a knowledge graph that enables complex competitive intelligence queries impossible with traditional vector-only RAG.
Pharmaceutical companies release quarterly pipeline updates as PDF presentations. These contain rich information about:
- Molecules in development (brands, generics, mechanisms)
- Clinical trials (phases, indications, timelines)
- Partnerships and licensing deals
- Competitive landscapes by therapeutic area
The Problem: This data is locked in complex slide layouts (tables, timelines, nested information, images) that are hard to analyze at scale.
The Solution: Extract entities and relationships into a Neo4j graph, enabling queries like:
- "Which therapeutic mechanisms are being pursued by multiple companies, and which companies are competing in the same mechanism space?"
- "Which individual molecules are most frequently used as components in combination therapies, and what roles do they play?"
- "Which molecules are being developed for multiple diseases across different therapeutic areas, indicating broad platform potential?"
Build the complete knowledge graph in ~4 minutes:
# 1. Setup (one-time)
python scripts/setup_db.py
# 2. Build full graph (recommended for first run)
python scripts/ingest_document.py --all # ~30 seconds (102 pages)
python scripts/extract_entities.py --all --parallel 30 # ~3-4 minutes (344 molecules)
python scripts/embed_pages.py --all # ~5 seconds (102 embeddings) [OPTIONAL]
python scripts/postprocess_phase3b.py --all # ~30 seconds (454 relationships) [RECOMMENDED]
# Done! Query your knowledge graph in Neo4j Browsera dump of the database is also available here (560Mb)
pharma-pipeline/
βββ data/ # PDF documents
βββ documents_metadata.csv # Document metadata
βββ pipeline/ # Core pipeline modules
β βββ config.py # Configuration management
β βββ pdf_processor.py # PDF β text + images
β βββ neo4j_loader.py # Load data into Neo4j
β βββ llm_extractor.py # LLM-based extraction (Phase 2)
β βββ embedder.py # Page embeddings (Phase 2.5)
βββ scripts/ # CLI scripts
β βββ setup_db.py # Initialize Neo4j schema
β βββ ingest_document.py # Phase 1: Ingest documents (lexical graph)
β βββ extract_entities.py # Phase 2: Extract entities (parallel)
β βββ embed_pages.py # Phase 2.5: Create embeddings (optional)
β βββ postprocess_phase3b.py # Phase 3: PostProcessing relationships (recommended)
β βββ test_extraction.py # Test extraction on single page
βββ logs/ # Log files
βββ failed_extractions/ # Failed extraction records
βββ requirements.txt # Python dependencies
- Python 3.10+
- Neo4j 5.x (running locally or Neo4j Aura)
- OpenAI API key with GPT-5-mini access (or GPT-4o/GPT-4o-mini)
- Tier 5 recommended for optimal parallel processing (30+ concurrent)
# Clone/navigate to project
cd pharma-pipeline
# Install dependencies with uv
uv pip install -r requirements.txtCreate a .env file in the project root:
# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE=neo4j
# OpenAI Configuration
OPENAI_API_KEY=sk-...
EXTRACT_MODEL=gpt-5-mini # For entity extraction
EMBEDDING_MODEL=text-embedding-3-small # For page embeddings (optional)
# Optional: Logging
LOG_LEVEL=INFOModel Recommendations:
- gpt-5-mini: Best balance of speed/cost/quality for extraction (recommended)
- text-embedding-3-small: Fast, cheap embeddings (1536 dims) - to demonstrate VectorRAG vs GraphRAG
See env.template for full configuration options.
Run once to create constraints and indexes:
python scripts/setup_db.pySingle document:
python scripts/ingest_document.py --document-id abbvie_pipeline_2024All documents:
python scripts/ingest_document.py --allForce re-ingest:
python scripts/ingest_document.py --document-id abbvie_pipeline_2024 --forceAll documents with page-level parallel processing (RECOMMENDED):
python scripts/extract_entities.py --all --parallel 30Single document:
python scripts/extract_entities.py --document-id abbvie_pipeline_2024Extract specific pages:
python scripts/extract_entities.py --document-id abbvie_pipeline_2024 --pages 5,6,7Performance:
- With
--parallel 30, expect ~25-30 pages/minute depending on page density - Multi-document processing: All 102 pages across 5 documents process simultaneously
- Recent run: 102 pages in 3.8 minutes = 26.75 pages/min with 0 failures β¨
- Single document: Same performance as before
Create embeddings for all pages:
python scripts/embed_pages.py --allSingle document:
python scripts/embed_pages.py --document-id abbvie_pipeline_2024Force re-embed (if you change embedding model):
python scripts/embed_pages.py --all --forceControl parallelism:
python scripts/embed_pages.py --all --parallel 50Why embeddings?
- Optional step for comparing GraphRAG vs traditional vector-only RAG
- Fast: ~1000 pages/minute (vs 10-30 for extraction)
- Cheap: ~$0.00001 per page (vs ~$0.03 for extraction)
Run all post-processing steps:
python scripts/postprocess_phase3b.py --allWhat it does:
- β
Creates
DEVELOPSrelationships:(Company)-[:DEVELOPS]->(Molecule|Combination) - β
Creates
SAME_ASrelationships: Links brand β generic names and case variants - β
Creates
INVOLVESrelationships: Links partnerships to companies - β Normalizes TherapeuticArea: Merges singular/plural variants
Preview changes (dry run):
python scripts/postprocess_phase3b.py --dry-run --allRun specific steps:
# Only DEVELOPS relationships
python scripts/postprocess_phase3b.py --develops
# Only SAME_AS relationships
python scripts/postprocess_phase3b.py --same-as
# SAME_AS + Partnership linking
python scripts/postprocess_phase3b.py --same-as --partnershipsVerify existing relationships:
python scripts/postprocess_phase3b.py --verifyUse Natural Language to Query Your Graph!
Instead of writing Cypher queries, connect Claude Desktop to your Neo4j database using the Model Context Protocol (MCP) and ask questions in natural language.
The Model Context Protocol allows LLM applications (like Claude Desktop) to connect to external tools and data sources. The Neo4j MCP server gives Claude the ability to:
- Read your graph schema
- Execute Cypher queries
- Answer complex questions about your pharmaceutical pipeline
-
Install Neo4j MCP Server (if not already installed)
# Follow instructions at: # https://github.com/neo4j-contrib/mcp-neo4j
-
Configure Claude Desktop
Add to your Claude Desktop config file:
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Windows:%APPDATA%\Claude\claude_desktop_config.json{ "mcpServers": { "neo4j": { "command": "neo4j-mcp", "env": { "NEO4J_URI": "bolt://localhost:7687", "NEO4J_USERNAME": "neo4j", "NEO4J_PASSWORD": "your_password", "NEO4J_DATABASE": "neo4j" } } } } -
Restart Claude Desktop
-
Ask Questions!
Try these in Claude:
"What molecules is AbbVie developing?" "Which companies are developing JAK inhibitors?" "Find partnerships involving Bristol Myers Squibb" "What therapeutic areas have the most molecules in development?"
Learn More:
- Neo4j MCP Server Documentation
- Official Neo4j MCP GitHub
- MCP Inspector - Test MCP servers
Purpose: Capture document structure and enable provenance tracking.
The lexical graph models the physical structure of PDF documents:
- Companies publish Documents (quarterly pipeline updates)
- Documents contain Pages with extracted text and images
- Pages link sequentially via
NEXTrelationships for navigation
graph LR
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Document["Document<br/>id: STRING | KEY<br/>name: STRING<br/>company: STRING<br/>date: DATE<br/>documentType: STRING<br/>location: STRING"]
Page["Page<br/>id: STRING | KEY<br/>pageNumber: INTEGER<br/>extractedText: STRING<br/>extractedImage: STRING<br/>embedding: LIST"]
%% Relationships
Company -->|PUBLISHED| Document
Document -->|HAS_PAGE| Page
Page -->|NEXT| Page
%% Styling
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color
classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Document node_1_color
classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Page node_2_color
Purpose: Capture business entities, relationships, and competitive intelligence.
The entity graph models pharmaceutical pipeline data extracted from documents using Vision LLMs:
- Drug entities: Molecules, combinations, mechanisms, targets, modalities
- Disease context: Diseases, therapeutic areas, indications, biomarkers
- Clinical development: Trials, milestones, geographic regions
- Business relationships: Partnerships between companies
graph TD
%% Nodes
Company["Company<br/>name: STRING | KEY"]
Molecule["Molecule<br/>name: STRING | KEY<br/>genericName: STRING<br/>brandName: STRING<br/>internalCode: STRING"]
Combination["Combination<br/>id: STRING | KEY<br/>name: STRING<br/>description: STRING"]
Modality["Modality<br/>name: STRING | KEY"]
Mechanism["Mechanism<br/>name: STRING | KEY"]
MechanismCategory["MechanismCategory<br/>name: STRING | KEY"]
Target["Target<br/>name: STRING | KEY"]
Disease["Disease<br/>name: STRING | KEY"]
TherapeuticArea["TherapeuticArea<br/>name: STRING | KEY"]
Indication["Indication<br/>id: STRING | KEY<br/>description: STRING<br/>lineOfTherapy: STRING<br/>setting: STRING"]
Biomarker["Biomarker<br/>name: STRING | KEY<br/>expression: STRING"]
ClinicalTrial["ClinicalTrial<br/>id: STRING | KEY<br/>nctId: STRING<br/>studyName: STRING<br/>phase: STRING<br/>status: STRING"]
Milestone["Milestone<br/>id: STRING | KEY<br/>type: STRING<br/>date: DATE<br/>status: STRING"]
Geography["Geography<br/>region: STRING | KEY<br/>fullName: STRING"]
Partnership["Partnership<br/>id: STRING | KEY<br/>partnerName: STRING<br/>type: STRING<br/>description: STRING"]
%% Relationships
Company -->|DEVELOPS| Molecule
Company -->|DEVELOPS| Combination
Molecule -->|HAS_MODALITY| Modality
Molecule -->|HAS_MECHANISM| Mechanism
Mechanism -->|IN_CATEGORY| MechanismCategory
Mechanism -->|TARGETS| Target
Molecule -->|TREATS| Indication
Combination -->|INCLUDES| Molecule
Indication -->|FOR_DISEASE| Disease
Disease -->|IN_THERAPEUTIC_AREA| TherapeuticArea
Indication -->|REQUIRES_BIOMARKER| Biomarker
Molecule -->|IN_TRIAL| ClinicalTrial
ClinicalTrial -->|IN_REGION| Geography
Milestone -->|IN_REGION| Geography
Company -->|PARTNERS_WITH| Partnership
Partnership -->|FOR_MOLECULE| Molecule
Molecule -->|SAME_AS| Molecule
%% Styling
classDef node_0_color fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000,font-size:12px
class Company node_0_color
classDef node_1_color fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000,font-size:12px
class Molecule node_1_color
classDef node_2_color fill:#e8f5e8,stroke:#388e3c,stroke-width:3px,color:#000,font-size:12px
class Combination node_2_color
classDef node_3_color fill:#fff3e0,stroke:#f57c00,stroke-width:3px,color:#000,font-size:12px
class Modality node_3_color
classDef node_4_color fill:#fce4ec,stroke:#c2185b,stroke-width:3px,color:#000,font-size:12px
class Mechanism node_4_color
classDef node_5_color fill:#e0f2f1,stroke:#00695c,stroke-width:3px,color:#000,font-size:12px
class MechanismCategory node_5_color
classDef node_6_color fill:#f1f8e9,stroke:#689f38,stroke-width:3px,color:#000,font-size:12px
class Target node_6_color
classDef node_7_color fill:#fff8e1,stroke:#ffa000,stroke-width:3px,color:#000,font-size:12px
class Disease node_7_color
classDef node_8_color fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px,color:#000,font-size:12px
class TherapeuticArea node_8_color
classDef node_9_color fill:#efebe9,stroke:#5d4037,stroke-width:3px,color:#000,font-size:12px
class Indication node_9_color
classDef node_10_color fill:#fafafa,stroke:#424242,stroke-width:3px,color:#000,font-size:12px
class Biomarker node_10_color
classDef node_11_color fill:#e1f5fe,stroke:#0277bd,stroke-width:3px,color:#000,font-size:12px
class ClinicalTrial node_11_color
classDef node_12_color fill:#f9fbe7,stroke:#827717,stroke-width:3px,color:#000,font-size:12px
class Milestone node_12_color
classDef node_13_color fill:#fff1f0,stroke:#d32f2f,stroke-width:3px,color:#000,font-size:12px
class Geography node_13_color
classDef node_14_color fill:#f4e6ff,stroke:#6a1b9a,stroke-width:3px,color:#000,font-size:12px
class Partnership node_14_color
- β Parse PDF documents
- β Extract page text and images
- β Create Document β Page structure in Neo4j
- β Sequential page navigation with NEXT relationships
- Result: 102 pages loaded across 5 documents (AbbVie, Bristol Myers Squibb, Bayer, J&J, Pfizer)
- β Vision LLM (GPT-5-mini) with text + image input
- β Structured output with Pydantic validation
- β Extract: Molecules, Diseases, Trials, Partnerships, Mechanisms, Targets, etc.
- β Page-level parallel processing (30 concurrent pages across ALL documents)
- β Full provenance tracking (EXTRACTED_FROM relationships)
- Result: 344 molecules, 503 indications, 582 treatment relationships in 3.8 minutes (26.75 pages/min)
- Optimization: Pages from multiple documents processed simultaneously for 2-4x faster batch processing
- β Generate embeddings for page text using OpenAI embeddings API
- β Derive direct CompanyβMolecule/Combination relationships from provenance
- β Normalize entity names across documents (e.g., "Opdivo" vs "nivolumab")
- β Create SAME_AS relationships or merge duplicate nodes
- Post-processing with ontologies (MONDO, PubChem, UniProt, etc.) (NEXT)
- Build Cypher query templates for strategic questions
- Text-to-Cypher agent (LangChain)
- Hybrid vector + graph RAG
- Streamlit exploration UI
- Competitive intelligence dashboard
Edit documents_metadata.csv to add new documents:
documentId,company,name,date,documentType,location,url
abbvie_pipeline_2024,AbbVie,AbbVie Pipeline Update,2024-02-02,Pipeline Update,data/AbbVie.pdf,
// Top molecules by number of indications
MATCH (m:Molecule)-[:TREATS]->(i:Indication)
WITH m, count(DISTINCT i) as indications
RETURN m.name, m.brandName, indications
ORDER BY indications DESC
LIMIT 10
// Find competitive landscape for a disease
MATCH (d:Disease {name: "Non-Small Cell Lung Cancer"})<-[:FOR_DISEASE]-(i:Indication)<-[:TREATS]-(m:Molecule)
OPTIONAL MATCH (m)-[:IN_TRIAL]->(ct:ClinicalTrial)
RETURN m.name, m.brandName,
count(DISTINCT i) as indications,
collect(DISTINCT ct.phase) as trial_phases
ORDER BY indications DESC
// Molecules with mechanisms and targets
MATCH (m:Molecule)-[:HAS_MECHANISM]->(mech:Mechanism)-[:TARGETS]->(t:Target)
RETURN m.name, mech.name, collect(t.name) as targets
LIMIT 10
// Partnership analysis
MATCH (p:Partnership)-[:FOR_MOLECULE]->(m:Molecule)
MATCH (p)-[:PARTNERS_WITH]->(partner:Company)
RETURN m.name, collect(partner.name) as partners
ORDER BY size(partners) DESC
LIMIT 10