Skip to content

Specialized fork of LangExtract for extracting aging biomarkers from scientific literature. Integrates PubMed, bioRxiv APIs with domain-specific ontologies (GO, KEGG, UniProt) for automated biomarker validation in longevity research.

License

Notifications You must be signed in to change notification settings

AntonioVFranco/BiomarkerExtract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

132 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

banner-bioexct-github

BiomarkerExtract

AI-Powered Biomarker Discovery from Scientific Literature

Version Python License Status Tests Medium LinkedIn

Automated extraction and validation of aging biomarkers using state-of-the-art Large Language Models

FeaturesQuick StartDocumentationResultsCitation


Contact

Feel free to contact me via email for any needs: [email protected]


Overview

BiomarkerExtract is a production-ready pipeline for discovering and validating aging biomarkers from scientific literature using Large Language Models. Built on Google's LangExtract framework, it supports multiple LLM providers and delivers publication-quality results at ultra-low cost.

Key Achievements

  • 79 biomarkers extracted from 46 scientific papers
  • 93.7% validation rate with scientific evidence
  • 84.8% high confidence (≥0.90) extractions
  • $0.003 per paper processing cost
  • 5 LLM providers supported out-of-the-box

Features

Multi-Provider LLM Support

  • OpenRouter - 100+ models with single API key (Recommended)
  • OpenAI - GPT-5.2, GPT-4o, O1
  • Anthropic - Claude 4.5, Sonnet 4.5
  • Google - Gemini 3.0, Gemini Pro
  • Ollama - Local inference (FREE)

Complete Pipeline

  1. Literature Search - PubMed + bioRxiv integration
  2. Biomarker Extraction - LLM-powered entity recognition
  3. Scientific Validation - Automated quality assessment
  4. Multi-Format Export - JSON, CSV, TXT

Analysis & Visualization

  • Publication-quality charts
  • Network analysis
  • Category distribution
  • Confidence metrics
  • Sample size statistics

Quick Start

Installation

# Clone repository
git clone https://github.com/AntonioVFranco/BiomarkerExtract.git
cd BiomarkerExtract

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run installation
bash install_unified.sh

Configuration

# Set your API key (choose one provider)
export OPENROUTER_API_KEY="sk-or-v1-xxxxxxxx"  # Recommended
# OR
export OPENAI_API_KEY="sk-xxxxxxxx"
# OR
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxx"

Run Pipeline

# Quick start with OpenRouter (cheapest)
bash run_openrouter.sh

# Or with other providers
bash run_openai.sh      # OpenAI GPT-5.2
bash run_anthropic.sh   # Claude 4.5
bash run_ollama.sh      # Local (FREE)

Python API

from langextract.providers import unified_production_pipeline as upp

results = upp.run_pipeline(
    biomarker_terms=["Horvath clock", "GDF-15", "NAD+"],
    pubmed_email="[email protected]",
    provider="openrouter",
    api_key="your-key",
    max_papers=20
)

print(f"Extracted {results['statistics']['biomarkers_extracted']} biomarkers!")

Results

Sample Extraction (v0.1)

Metric Value
Papers Processed 46
Biomarkers Extracted 79
Validated 74 (93.7%)
High Confidence 67 (84.8%)
Processing Time 8.35 minutes
Total Cost $0.15

Top Biomarkers Discovered

  1. Horvath clock (12 mentions) - Epigenetic
  2. GDF-15 (9 mentions) - Proteomic
  3. NAD+ levels (3 mentions) - Metabolomic
  4. Hannum clock (2 mentions) - Epigenetic
  5. DunedinPACE (2 mentions) - Epigenetic

Category Distribution

  • Epigenetic: 43.0%
  • Proteomic: 34.2%
  • Cellular: 10.1%
  • Metabolomic: 6.3%
  • Genomic: 2.5%
  • Transcriptomic: 2.5%

Documentation

Examples

See examples_unified.py for 9 complete working examples:

  • Basic extraction
  • Batch processing
  • Provider comparison
  • Custom models
  • Complete pipeline

Cost Comparison

Processing 1000 papers:

Provider Cost Speed Accuracy
OpenRouter $3.00 Fast 88%
OpenAI GPT-5.2 $25.00 Fast 92%
Anthropic Claude 4.5 $30.00 Medium 90%
Google Gemini 3.0 $10.00 Fast 85%
Ollama FREE Slow* 80%

*Depends on local GPU


Architecture

BiomarkerExtract/
├── langextract/
│   ├── core/
│   │   └── biomarker_models.py      # 21 Pydantic models
│   ├── literature/
│   │   ├── pubmed_client.py         # PubMed API
│   │   ├── biorxiv_client.py        # bioRxiv API
│   │   ├── pdf_parser.py            # PDF extraction
│   │   └── batch_processor.py       # Parallel processing
│   └── providers/
│       ├── unified_llm_provider.py           # 5 LLM providers
│       └── unified_production_pipeline.py    # End-to-end pipeline
├── tests/
│   └── option2/                     # 30+ tests (93% passing)
├── examples_unified.py              # 9 working examples
└── run_*.sh                         # Quick-start scripts

Testing

# Run complete test suite
cd tests/option2
bash full_test.sh

# Quick validation
bash quick_test.sh

# Results: 93% tests passing

Statistics

  • ~5,700 lines of production code
  • 21 Pydantic models for data validation
  • 5 LLM providers integrated
  • 30+ unit tests (93% success rate)
  • 3 formats for data export (JSON, CSV, TXT)
  • Publication-quality visualizations included

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Based on Google's LangExtract framework.


Citation

If you use BiomarkerExtract in your research, please cite:

@software{biomarkerextract2026,
  author = {Franco, Antonio V.},
  title = {BiomarkerExtract: AI-Powered Biomarker Discovery from Scientific Literature},
  year = {2026},
  version = {0.1},
  url = {https://github.com/AntonioVFranco/BiomarkerExtract}
}

Acknowledgments

  • Built on Google's LangExtract
  • Inspired by aging research and longevity science
  • Powered by state-of-the-art Large Language Models

Connect


Star this repo if you find it useful!

Report BugRequest FeatureDocumentation

About

Specialized fork of LangExtract for extracting aging biomarkers from scientific literature. Integrates PubMed, bioRxiv APIs with domain-specific ontologies (GO, KEGG, UniProt) for automated biomarker validation in longevity research.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.9%
  • Shell 3.1%