AI-Powered Biomarker Discovery from Scientific Literature
Automated extraction and validation of aging biomarkers using state-of-the-art Large Language Models
Features • Quick Start • Documentation • Results • Citation
Feel free to contact me via email for any needs: [email protected]
BiomarkerExtract is a production-ready pipeline for discovering and validating aging biomarkers from scientific literature using Large Language Models. Built on Google's LangExtract framework, it supports multiple LLM providers and delivers publication-quality results at ultra-low cost.
- ✅ 79 biomarkers extracted from 46 scientific papers
- ✅ 93.7% validation rate with scientific evidence
- ✅ 84.8% high confidence (≥0.90) extractions
- ✅ $0.003 per paper processing cost
- ✅ 5 LLM providers supported out-of-the-box
- OpenRouter - 100+ models with single API key (Recommended)
- OpenAI - GPT-5.2, GPT-4o, O1
- Anthropic - Claude 4.5, Sonnet 4.5
- Google - Gemini 3.0, Gemini Pro
- Ollama - Local inference (FREE)
- Literature Search - PubMed + bioRxiv integration
- Biomarker Extraction - LLM-powered entity recognition
- Scientific Validation - Automated quality assessment
- Multi-Format Export - JSON, CSV, TXT
- Publication-quality charts
- Network analysis
- Category distribution
- Confidence metrics
- Sample size statistics
# Clone repository
git clone https://github.com/AntonioVFranco/BiomarkerExtract.git
cd BiomarkerExtract
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run installation
bash install_unified.sh# Set your API key (choose one provider)
export OPENROUTER_API_KEY="sk-or-v1-xxxxxxxx" # Recommended
# OR
export OPENAI_API_KEY="sk-xxxxxxxx"
# OR
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxx"# Quick start with OpenRouter (cheapest)
bash run_openrouter.sh
# Or with other providers
bash run_openai.sh # OpenAI GPT-5.2
bash run_anthropic.sh # Claude 4.5
bash run_ollama.sh # Local (FREE)from langextract.providers import unified_production_pipeline as upp
results = upp.run_pipeline(
biomarker_terms=["Horvath clock", "GDF-15", "NAD+"],
pubmed_email="[email protected]",
provider="openrouter",
api_key="your-key",
max_papers=20
)
print(f"Extracted {results['statistics']['biomarkers_extracted']} biomarkers!")| Metric | Value |
|---|---|
| Papers Processed | 46 |
| Biomarkers Extracted | 79 |
| Validated | 74 (93.7%) |
| High Confidence | 67 (84.8%) |
| Processing Time | 8.35 minutes |
| Total Cost | $0.15 |
- Horvath clock (12 mentions) - Epigenetic
- GDF-15 (9 mentions) - Proteomic
- NAD+ levels (3 mentions) - Metabolomic
- Hannum clock (2 mentions) - Epigenetic
- DunedinPACE (2 mentions) - Epigenetic
- Epigenetic: 43.0%
- Proteomic: 34.2%
- Cellular: 10.1%
- Metabolomic: 6.3%
- Genomic: 2.5%
- Transcriptomic: 2.5%
- UNIFIED_README.md - Complete system overview
- UNIFIED_CONFIGURATION.md - Provider setup guides
- Phase3_README.md - Core biomarker models
- Phase4_README.md - Literature pipeline
- Option2_Testing_README.md - Testing suite
See examples_unified.py for 9 complete working examples:
- Basic extraction
- Batch processing
- Provider comparison
- Custom models
- Complete pipeline
Processing 1000 papers:
| Provider | Cost | Speed | Accuracy |
|---|---|---|---|
| OpenRouter | $3.00 ⭐ | Fast | 88% |
| OpenAI GPT-5.2 | $25.00 | Fast | 92% |
| Anthropic Claude 4.5 | $30.00 | Medium | 90% |
| Google Gemini 3.0 | $10.00 | Fast | 85% |
| Ollama | FREE ⭐ | Slow* | 80% |
*Depends on local GPU
BiomarkerExtract/
├── langextract/
│ ├── core/
│ │ └── biomarker_models.py # 21 Pydantic models
│ ├── literature/
│ │ ├── pubmed_client.py # PubMed API
│ │ ├── biorxiv_client.py # bioRxiv API
│ │ ├── pdf_parser.py # PDF extraction
│ │ └── batch_processor.py # Parallel processing
│ └── providers/
│ ├── unified_llm_provider.py # 5 LLM providers
│ └── unified_production_pipeline.py # End-to-end pipeline
├── tests/
│ └── option2/ # 30+ tests (93% passing)
├── examples_unified.py # 9 working examples
└── run_*.sh # Quick-start scripts
# Run complete test suite
cd tests/option2
bash full_test.sh
# Quick validation
bash quick_test.sh
# Results: 93% tests passing- ~5,700 lines of production code
- 21 Pydantic models for data validation
- 5 LLM providers integrated
- 30+ unit tests (93% success rate)
- 3 formats for data export (JSON, CSV, TXT)
- Publication-quality visualizations included
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Based on Google's LangExtract framework.
If you use BiomarkerExtract in your research, please cite:
@software{biomarkerextract2026,
author = {Franco, Antonio V.},
title = {BiomarkerExtract: AI-Powered Biomarker Discovery from Scientific Literature},
year = {2026},
version = {0.1},
url = {https://github.com/AntonioVFranco/BiomarkerExtract}
}- Built on Google's LangExtract
- Inspired by aging research and longevity science
- Powered by state-of-the-art Large Language Models
- Email: [email protected]
- Medium: @AntonioVFranco
- LinkedIn: antoniovfranco
- GitHub: @AntonioVFranco
⭐ Star this repo if you find it useful! ⭐