BiomarkerExtract

BiomarkerExtract

AI-Powered Biomarker Discovery from Scientific Literature

Automated extraction and validation of aging biomarkers using state-of-the-art Large Language Models

Features • Quick Start • Documentation • Results • Citation

Contact

Feel free to contact me via email for any needs: [email protected]

Overview

BiomarkerExtract is a production-ready pipeline for discovering and validating aging biomarkers from scientific literature using Large Language Models. Built on Google's LangExtract framework, it supports multiple LLM providers and delivers publication-quality results at ultra-low cost.

Key Achievements

✅ 79 biomarkers extracted from 46 scientific papers
✅ 93.7% validation rate with scientific evidence
✅ 84.8% high confidence (≥0.90) extractions
✅ $0.003 per paper processing cost
✅ 5 LLM providers supported out-of-the-box

Features

Multi-Provider LLM Support

OpenRouter - 100+ models with single API key (Recommended)
OpenAI - GPT-5.2, GPT-4o, O1
Anthropic - Claude 4.5, Sonnet 4.5
Google - Gemini 3.0, Gemini Pro
Ollama - Local inference (FREE)

Complete Pipeline

Literature Search - PubMed + bioRxiv integration
Biomarker Extraction - LLM-powered entity recognition
Scientific Validation - Automated quality assessment
Multi-Format Export - JSON, CSV, TXT

Analysis & Visualization

Publication-quality charts
Network analysis
Category distribution
Confidence metrics
Sample size statistics

Quick Start

Installation

# Clone repository
git clone https://github.com/AntonioVFranco/BiomarkerExtract.git
cd BiomarkerExtract

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run installation
bash install_unified.sh

Configuration

# Set your API key (choose one provider)
export OPENROUTER_API_KEY="sk-or-v1-xxxxxxxx"  # Recommended
# OR
export OPENAI_API_KEY="sk-xxxxxxxx"
# OR
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxx"

Run Pipeline

# Quick start with OpenRouter (cheapest)
bash run_openrouter.sh

# Or with other providers
bash run_openai.sh      # OpenAI GPT-5.2
bash run_anthropic.sh   # Claude 4.5
bash run_ollama.sh      # Local (FREE)

Python API

from langextract.providers import unified_production_pipeline as upp

results = upp.run_pipeline(
    biomarker_terms=["Horvath clock", "GDF-15", "NAD+"],
    pubmed_email="[email protected]",
    provider="openrouter",
    api_key="your-key",
    max_papers=20
)

print(f"Extracted {results['statistics']['biomarkers_extracted']} biomarkers!")

Results

Sample Extraction (v0.1)

Metric	Value
Papers Processed	46
Biomarkers Extracted	79
Validated	74 (93.7%)
High Confidence	67 (84.8%)
Processing Time	8.35 minutes
Total Cost	$0.15

Top Biomarkers Discovered

Horvath clock (12 mentions) - Epigenetic
GDF-15 (9 mentions) - Proteomic
NAD+ levels (3 mentions) - Metabolomic
Hannum clock (2 mentions) - Epigenetic
DunedinPACE (2 mentions) - Epigenetic

Category Distribution

Epigenetic: 43.0%
Proteomic: 34.2%
Cellular: 10.1%
Metabolomic: 6.3%
Genomic: 2.5%
Transcriptomic: 2.5%

Documentation

UNIFIED_README.md - Complete system overview
UNIFIED_CONFIGURATION.md - Provider setup guides
Phase3_README.md - Core biomarker models
Phase4_README.md - Literature pipeline
Option2_Testing_README.md - Testing suite

Examples

See examples_unified.py for 9 complete working examples:

Basic extraction
Batch processing
Provider comparison
Custom models
Complete pipeline

Cost Comparison

Processing 1000 papers:

Provider	Cost	Speed	Accuracy
OpenRouter	$3.00 ⭐	Fast	88%
OpenAI GPT-5.2	$25.00	Fast	92%
Anthropic Claude 4.5	$30.00	Medium	90%
Google Gemini 3.0	$10.00	Fast	85%
Ollama	FREE ⭐	Slow*	80%

*Depends on local GPU

Architecture

BiomarkerExtract/
├── langextract/
│   ├── core/
│   │   └── biomarker_models.py      # 21 Pydantic models
│   ├── literature/
│   │   ├── pubmed_client.py         # PubMed API
│   │   ├── biorxiv_client.py        # bioRxiv API
│   │   ├── pdf_parser.py            # PDF extraction
│   │   └── batch_processor.py       # Parallel processing
│   └── providers/
│       ├── unified_llm_provider.py           # 5 LLM providers
│       └── unified_production_pipeline.py    # End-to-end pipeline
├── tests/
│   └── option2/                     # 30+ tests (93% passing)
├── examples_unified.py              # 9 working examples
└── run_*.sh                         # Quick-start scripts

Testing

# Run complete test suite
cd tests/option2
bash full_test.sh

# Quick validation
bash quick_test.sh

# Results: 93% tests passing

Statistics

~5,700 lines of production code
21 Pydantic models for data validation
5 LLM providers integrated
30+ unit tests (93% success rate)
3 formats for data export (JSON, CSV, TXT)
Publication-quality visualizations included

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Based on Google's LangExtract framework.

Citation

If you use BiomarkerExtract in your research, please cite:

@software{biomarkerextract2026,
  author = {Franco, Antonio V.},
  title = {BiomarkerExtract: AI-Powered Biomarker Discovery from Scientific Literature},
  year = {2026},
  version = {0.1},
  url = {https://github.com/AntonioVFranco/BiomarkerExtract}
}

Acknowledgments

Built on Google's LangExtract
Inspired by aging research and longevity science
Powered by state-of-the-art Large Language Models

Connect

⭐ Star this repo if you find it useful! ⭐

Report Bug • Request Feature • Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
langextract		langextract
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CITATION.cff		CITATION.cff
COMMUNITY_PROVIDERS.md		COMMUNITY_PROVIDERS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Option2_Testing_README.md		Option2_Testing_README.md
Phase3_README.md		Phase3_README.md
Phase4_README.md		Phase4_README.md
README.md		README.md
UNIFIED_CONFIGURATION.md		UNIFIED_CONFIGURATION.md
UNIFIED_README.md		UNIFIED_README.md
accuracy_metrics.py		accuracy_metrics.py
autoformat.sh		autoformat.sh
batch_processor.py		batch_processor.py
benchmark_suite.py		benchmark_suite.py
biomarker_models.py		biomarker_models.py
biomarker_models_test.py		biomarker_models_test.py
biorxiv_client.py		biorxiv_client.py
example_literature_pipeline.py		example_literature_pipeline.py
examples_unified.py		examples_unified.py
gemini_biomarker.py		gemini_biomarker.py
install_option2.sh		install_option2.sh
install_phase3.sh		install_phase3.sh
install_phase4.sh		install_phase4.sh
install_unified.sh		install_unified.sh
metadata_models.py		metadata_models.py
pdf_parser.py		pdf_parser.py
pubmed_client.py		pubmed_client.py
pyproject.toml		pyproject.toml
run_ollama.sh		run_ollama.sh
run_openai.sh		run_openai.sh
run_openrouter.sh		run_openrouter.sh
run_tests.py		run_tests.py
test_biomarker_categories.py		test_biomarker_categories.py
tox.ini		tox.ini
unified_llm_provider.py		unified_llm_provider.py
unified_production_pipeline.py		unified_production_pipeline.py
validation_dataset.py		validation_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiomarkerExtract

Contact

Overview

Key Achievements

Features

Multi-Provider LLM Support

Complete Pipeline

Analysis & Visualization

Quick Start

Installation

Configuration

Run Pipeline

Python API

Results

Sample Extraction (v0.1)

Top Biomarkers Discovered

Category Distribution

Documentation

Examples

Cost Comparison

Architecture

Testing

Statistics

Contributing

License

Citation

Acknowledgments

Connect

About

Uh oh!

Releases

Packages

Languages

License

AntonioVFranco/BiomarkerExtract

Folders and files

Latest commit

History

Repository files navigation

BiomarkerExtract

Contact

Overview

Key Achievements

Features

Multi-Provider LLM Support

Complete Pipeline

Analysis & Visualization

Quick Start

Installation

Configuration

Run Pipeline

Python API

Results

Sample Extraction (v0.1)

Top Biomarkers Discovered

Category Distribution

Documentation

Examples

Cost Comparison

Architecture

Testing

Statistics

Contributing

License

Citation

Acknowledgments

Connect

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages