A hands-on comparison of two fundamentally different approaches to biomedical Named Entity Recognition (NER) on the same scientific paper — a fine-tuned clinical transformer vs GPT-4 zero-shot prompting.
This project extracts biomedical entities from a tissue engineering research paper (arXiv:2110.03526) using two distinct NLP strategies and compares their outputs across a normalized entity schema.
| Model | Approach | Entities Extracted |
|---|---|---|
d4data/biomedical-ner-all (HuggingFace) |
Fine-tuned on clinical corpora | 288 |
| GPT-4 (OpenAI) | Zero-shot prompting | 427 |
| Agreement (same entity + type) | — | 17 (3.1%) |
The 3.1% agreement rate is the central finding: these models were not built for the same task, and their disagreement reveals that more clearly than their individual outputs do.
PDF (arXiv)
│
▼
pdf2image → PIL Images
│
▼
pytesseract OCR → Raw Text
│
▼
Text Cleaning + NLTK Sentence Tokenization (~200 sentences)
│
├──────────────────────────────────────┐
▼ ▼
HuggingFace NER GPT-4 Zero-shot
d4data/biomedical-ner-all gpt-4 (temperature=0)
(runs locally, free) (API calls, paid)
│ │
▼ ▼
288 entities 427 entities
15 clinical types 11 research types
│ │
└──────────────┬───────────────────────┘
▼
Normalization to common schema
(Disease, Gene, Chemical, Anatomy,
Symptom, Procedure, Lab_value, Other)
│
▼
Agreement analysis
Overlap: 17 entities (3.1%)
| Type | HuggingFace | GPT-4 | Difference |
|---|---|---|---|
| Procedure | 90 | 1 | -89 |
| Other | 119 | 287 | +168 |
| Gene/Protein | 0 | 69 | +69 |
| Chemical | 2 | 19 | +17 |
| Disease | 26 | 39 | +13 |
| Symptom | 20 | 0 | -20 |
| Lab_value | 10 | 0 | -10 |
| Anatomy | 21 | 12 | -9 |
HuggingFace was fine-tuned on clinical documentation. It excels at procedural language, symptoms, and lab values — the kind of entities found in EHRs and clinical notes. It found zero gene/protein entities because those aren't what clinical NER models are trained to see.
GPT-4 was trained on broad text including scientific literature. It picks up molecular biology terminology (genes, proteins, chemicals) that the clinical model misses entirely. But 287 entities landed in "Other" — the prompt taxonomy needs refinement for production use.
Same text. Same paper. Different models saw different things. Architecture decisions don't just affect accuracy — they determine what kinds of entities your pipeline is capable of finding.
python >= 3.9
pytesseract
pdf2image
transformers
torch
openai
nltk
pandas
System dependency: tesseract-ocr (install via apt-get on Linux/Colab)
git clone https://github.com/shreyapatilu/biomedical-ner-comparison.git
cd biomedical-ner-comparisonpip install pytesseract pdf2image transformers torch openai nltk pandasOn Ubuntu/Colab, also install the Tesseract binary:
sudo apt-get install tesseract-ocrDo not hard-code your API key. Use an environment variable:
export OPENAI_API_KEY="your-key-here"In the notebook, replace "YOUR_API_KEY_HERE" with:
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])Run locally with Jupyter:
jupyter notebook biomedical_NER_comparison_github.ipynbOr open directly in Google Colab by uploading the .ipynb file.
biomedical-ner-comparison/
├── biomedical_NER_comparison_github .ipynb # Main notebook
└── README.md
| Use case | Recommended approach |
|---|---|
| Clinical NLP, adverse event detection, EHR processing | Fine-tuned clinical model (HuggingFace) |
| Drug discovery, literature mining, molecular biology | GPT-4 with validated prompts |
| Production systems requiring both coverage types | Hybrid pipeline |
This project was part of a personal self-learning series exploring NLP in healthcare. The full write-up is available on (LinkedIn).