Phenotype Prioritization System

A modern, production-ready Python application for ranking genes based on phenotype similarity using the Human Phenotype Ontology (HPO). This system combines multiple state-of-the-art algorithms with machine learning to improve diagnostic yield in genomic interpretation.

🎯 Overview

Phenotype ranking is crucial for tertiary analysis in genomic interpretation. This application:

Integrates patient phenotypic data (HPO terms) with genomic variant information
Scores and ranks genes based on their likelihood of explaining observed phenotypes
Implements multiple algorithms including semantic similarity, Phen2Gene-like, and Exomiser-like approaches
Enhances ranking with ML using gradient boosting (LightGBM) for optimal feature weighting
Provides comprehensive evaluation with statistical analysis and visualizations

✨ Features

Algorithms Implemented

Semantic Similarity Ranker
- Information Content (IC) based semantic similarity
- Resnik and Lin similarity metrics
- Best Match Average (BMA) scoring
Phen2Gene-like Ranker
- IC-weighted phenotype matching
- Phenotype specificity scoring
- Hierarchical term matching
Exomiser-like Ranker
- Multi-component scoring (similarity + frequency + coverage)
- Disease association integration
- SimJ semantic similarity
ML-Enhanced Ranker ⭐ NEW
- 21 comprehensive features extracted from gene-phenotype pairs
- LightGBM/XGBoost learning-to-rank
- Optimal feature weight learning from data
- Non-linear interaction modeling

Evaluation Framework

Comprehensive metrics: MRR, Hit@K, Precision, Recall, NDCG
Statistical significance testing (Wilcoxon signed-rank test)
Rank distribution analysis
Benchmark dataset generation with controlled noise levels

📋 Requirements

Python 3.9+
Poetry (for dependency management)

🚀 Installation

Option 1: Using Poetry (Recommended)

# Clone the repository
cd Phenotype-priortization

# Install dependencies with Poetry
poetry install

# Activate the virtual environment
poetry shell

Option 2: Using pip

# Install dependencies
pip install numpy pandas scikit-learn scipy networkx matplotlib seaborn plotly
pip install pronto requests tqdm joblib
pip install lightgbm xgboost optuna

# Install the package in development mode
pip install -e .

📊 Quick Start

1. Run Complete Evaluation

Run the comprehensive evaluation pipeline that trains models, evaluates all algorithms, and generates visualizations:

python run_evaluation.py

This will:

Load HPO ontology and gene-phenotype associations
Generate 100 synthetic test cases
Train the ML-enhanced model
Evaluate all algorithms
Perform statistical analysis
Generate comparison plots and reports

Results will be saved to the results/ directory.

2. Use the Command Line Interface

Rank genes for specific phenotypes:

# Using ML-enhanced algorithm
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263,HP:0002376 --algorithm ml

# Using baseline algorithms
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263 --algorithm phen2gene

Search for HPO terms:

python -m phenorank.cli search "seizures"

Get information about a gene or HPO term:

python -m phenorank.cli info HP:0001250
python -m phenorank.cli info SCN1A

3. Use as a Python Library

from phenorank.data.hpo_loader import HPOLoader
from phenorank.data.gene_loader import GeneLoader
from phenorank.models.phenotype_profile import PhenotypeProfile
from phenorank.ml.ml_ranker import MLRanker

# Load data
hpo_loader = HPOLoader()
hpo_terms = hpo_loader.load_hpo()

gene_loader = GeneLoader()
genes = gene_loader.load_genes()

# Initialize ML ranker and load trained model
ranker = MLRanker(hpo_terms, genes)
ranker.load_model("models/ml_ranker.pkl")

# Create patient phenotype profile
profile = PhenotypeProfile(
    patient_id="Patient001",
    hpo_terms=["HP:0001250", "HP:0001263", "HP:0002376"]  # Seizures, developmental delay, regression
)

# Rank genes
result = ranker.rank_genes(profile)

# Display top 10 genes
for gene_score in result.get_top_n(10):
    print(f"{gene_score.rank}. {gene_score.gene_symbol}: {gene_score.score:.4f}")

📁 Project Structure

Phenotype-priortization/
├── phenorank/                  # Main package
│   ├── algorithms/             # Ranking algorithms
│   │   ├── base.py            # Base ranker class
│   │   ├── semantic_similarity.py
│   │   ├── phen2gene.py
│   │   └── exomiser_like.py
│   ├── ml/                     # ML-enhanced ranking
│   │   ├── ml_ranker.py       # ML ranker implementation
│   │   └── feature_extractor.py  # Feature engineering
│   ├── models/                 # Data models
│   │   ├── hpo_term.py
│   │   ├── gene.py
│   │   ├── phenotype_profile.py
│   │   └── ranking_result.py
│   ├── data/                   # Data loading
│   │   ├── hpo_loader.py
│   │   └── gene_loader.py
│   ├── evaluation/             # Evaluation framework
│   │   ├── metrics.py
│   │   ├── evaluator.py
│   │   └── benchmark.py
│   ├── utils/                  # Utilities
│   │   ├── logging_config.py
│   │   └── visualization.py
│   └── cli.py                  # Command line interface
├── run_evaluation.py           # Main evaluation script
├── pyproject.toml             # Poetry dependencies
├── README.md                   # This file
├── ALGORITHMS.md              # Algorithm details
└── AI_USAGE.md                # AI tool usage documentation

📈 Evaluation Results

After running run_evaluation.py, you'll find:

Metrics Files

results/comparison_metrics.csv - Detailed metrics for all algorithms
results/statistical_tests.csv - Pairwise statistical significance tests
results/EVALUATION_REPORT.md - Comprehensive summary report

Visualizations

results/metrics_comparison.png - Bar chart comparison
results/radar_comparison.html - Interactive radar chart
results/rank_distributions.png - Distribution comparison
results/rank_dist_<algorithm>.png - Individual distributions

Example Results

Algorithm	MRR	Mean Rank	Hit@1	Hit@10	NDCG@10
MLRanker	0.XX	XX.X	0.XX	0.XX	0.XX
ExomiserLike	0.XX	XX.X	0.XX	0.XX	0.XX
Phen2Gene	0.XX	XX.X	0.XX	0.XX	0.XX
SemanticSimilarity	0.XX	XX.X	0.XX	0.XX	0.XX

Note: Run evaluation to generate actual results

🧬 Data Sources

HPO Ontology

Source: http://purl.obolibrary.org/obo/hp.obo
Description: Human Phenotype Ontology provides standardized vocabulary for phenotypic abnormalities
Auto-download: First run automatically downloads latest version

Gene-Phenotype Associations

Source: http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt
Description: Gene to phenotype annotations from HPO
Fallback: Demo data included for testing

🔬 Algorithm Details

See ALGORITHMS.md for detailed description of:

Semantic similarity calculations
Feature engineering for ML model
Scoring methodologies
Algorithmic improvements

🤖 AI Usage

This project was developed with assistance from AI tools. See AI_USAGE.md for:

Which parts used AI assistance
How AI-generated content was validated
Manual modifications and improvements

📊 Performance Metrics Explained

MRR (Mean Reciprocal Rank): Average of 1/rank across all queries. Higher is better.
Hit@K: Proportion of queries where true gene appears in top K results.
Precision@K: Proportion of relevant items in top K results.
NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality (0-1 scale).

🛠️ Development

Running Tests

# Run tests (when implemented)
pytest tests/

# Run with coverage
pytest --cov=phenorank tests/

Code Formatting

# Format code
black phenorank/

# Check style
flake8 phenorank/

📝 Citation

If you use this software in your research, please cite:

Phenotype Prioritization System
ML-Enhanced Gene Ranking for Genomic Interpretation
https://github.com/yourusername/phenotype-prioritization

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📄 License

This project is provided for educational and research purposes.

🆘 Troubleshooting

Common Issues

HPO download fails: Check internet connection or use demo data
ML training fails: Ensure LightGBM/XGBoost is installed correctly
Memory issues: Reduce number of test cases or use smaller gene set

Getting Help

Check the documentation
Review issues
Contact the development team

🎓 References

Key Papers

Phenotype matching: Robinson et al., "The Human Phenotype Ontology" (2008)
Phen2Gene: Zhao et al., "Phen2Gene: rapid phenotype-driven gene prioritization" (2020)
Exomiser: Smedley et al., "Next-generation diagnostics and disease-gene discovery" (2015)
Learning to Rank: Liu, "Learning to Rank for Information Retrieval" (2009)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
phenorank		phenorank
sample_data		sample_data
.gitignore		.gitignore
AI_USAGE.md		AI_USAGE.md
ALGORITHMS.md		ALGORITHMS.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
USAGE_INSTRUCTIONS.md		USAGE_INSTRUCTIONS.md
example_usage.py		example_usage.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py

schavan10/Phenotype-prioritization

Folders and files

Latest commit

History

Repository files navigation