A modern, production-ready Python application for ranking genes based on phenotype similarity using the Human Phenotype Ontology (HPO). This system combines multiple state-of-the-art algorithms with machine learning to improve diagnostic yield in genomic interpretation.
Phenotype ranking is crucial for tertiary analysis in genomic interpretation. This application:
- Integrates patient phenotypic data (HPO terms) with genomic variant information
- Scores and ranks genes based on their likelihood of explaining observed phenotypes
- Implements multiple algorithms including semantic similarity, Phen2Gene-like, and Exomiser-like approaches
- Enhances ranking with ML using gradient boosting (LightGBM) for optimal feature weighting
- Provides comprehensive evaluation with statistical analysis and visualizations
-
Semantic Similarity Ranker
- Information Content (IC) based semantic similarity
- Resnik and Lin similarity metrics
- Best Match Average (BMA) scoring
-
Phen2Gene-like Ranker
- IC-weighted phenotype matching
- Phenotype specificity scoring
- Hierarchical term matching
-
Exomiser-like Ranker
- Multi-component scoring (similarity + frequency + coverage)
- Disease association integration
- SimJ semantic similarity
-
ML-Enhanced Ranker β NEW
- 21 comprehensive features extracted from gene-phenotype pairs
- LightGBM/XGBoost learning-to-rank
- Optimal feature weight learning from data
- Non-linear interaction modeling
- Comprehensive metrics: MRR, Hit@K, Precision, Recall, NDCG
- Statistical significance testing (Wilcoxon signed-rank test)
- Rank distribution analysis
- Benchmark dataset generation with controlled noise levels
- Python 3.9+
- Poetry (for dependency management)
# Clone the repository
cd Phenotype-priortization
# Install dependencies with Poetry
poetry install
# Activate the virtual environment
poetry shell# Install dependencies
pip install numpy pandas scikit-learn scipy networkx matplotlib seaborn plotly
pip install pronto requests tqdm joblib
pip install lightgbm xgboost optuna
# Install the package in development mode
pip install -e .Run the comprehensive evaluation pipeline that trains models, evaluates all algorithms, and generates visualizations:
python run_evaluation.pyThis will:
- Load HPO ontology and gene-phenotype associations
- Generate 100 synthetic test cases
- Train the ML-enhanced model
- Evaluate all algorithms
- Perform statistical analysis
- Generate comparison plots and reports
Results will be saved to the results/ directory.
# Using ML-enhanced algorithm
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263,HP:0002376 --algorithm ml
# Using baseline algorithms
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263 --algorithm phen2genepython -m phenorank.cli search "seizures"python -m phenorank.cli info HP:0001250
python -m phenorank.cli info SCN1Afrom phenorank.data.hpo_loader import HPOLoader
from phenorank.data.gene_loader import GeneLoader
from phenorank.models.phenotype_profile import PhenotypeProfile
from phenorank.ml.ml_ranker import MLRanker
# Load data
hpo_loader = HPOLoader()
hpo_terms = hpo_loader.load_hpo()
gene_loader = GeneLoader()
genes = gene_loader.load_genes()
# Initialize ML ranker and load trained model
ranker = MLRanker(hpo_terms, genes)
ranker.load_model("models/ml_ranker.pkl")
# Create patient phenotype profile
profile = PhenotypeProfile(
patient_id="Patient001",
hpo_terms=["HP:0001250", "HP:0001263", "HP:0002376"] # Seizures, developmental delay, regression
)
# Rank genes
result = ranker.rank_genes(profile)
# Display top 10 genes
for gene_score in result.get_top_n(10):
print(f"{gene_score.rank}. {gene_score.gene_symbol}: {gene_score.score:.4f}")Phenotype-priortization/
βββ phenorank/ # Main package
β βββ algorithms/ # Ranking algorithms
β β βββ base.py # Base ranker class
β β βββ semantic_similarity.py
β β βββ phen2gene.py
β β βββ exomiser_like.py
β βββ ml/ # ML-enhanced ranking
β β βββ ml_ranker.py # ML ranker implementation
β β βββ feature_extractor.py # Feature engineering
β βββ models/ # Data models
β β βββ hpo_term.py
β β βββ gene.py
β β βββ phenotype_profile.py
β β βββ ranking_result.py
β βββ data/ # Data loading
β β βββ hpo_loader.py
β β βββ gene_loader.py
β βββ evaluation/ # Evaluation framework
β β βββ metrics.py
β β βββ evaluator.py
β β βββ benchmark.py
β βββ utils/ # Utilities
β β βββ logging_config.py
β β βββ visualization.py
β βββ cli.py # Command line interface
βββ run_evaluation.py # Main evaluation script
βββ pyproject.toml # Poetry dependencies
βββ README.md # This file
βββ ALGORITHMS.md # Algorithm details
βββ AI_USAGE.md # AI tool usage documentation
After running run_evaluation.py, you'll find:
results/comparison_metrics.csv- Detailed metrics for all algorithmsresults/statistical_tests.csv- Pairwise statistical significance testsresults/EVALUATION_REPORT.md- Comprehensive summary report
results/metrics_comparison.png- Bar chart comparisonresults/radar_comparison.html- Interactive radar chartresults/rank_distributions.png- Distribution comparisonresults/rank_dist_<algorithm>.png- Individual distributions
| Algorithm | MRR | Mean Rank | Hit@1 | Hit@10 | NDCG@10 |
|---|---|---|---|---|---|
| MLRanker | 0.XX | XX.X | 0.XX | 0.XX | 0.XX |
| ExomiserLike | 0.XX | XX.X | 0.XX | 0.XX | 0.XX |
| Phen2Gene | 0.XX | XX.X | 0.XX | 0.XX | 0.XX |
| SemanticSimilarity | 0.XX | XX.X | 0.XX | 0.XX | 0.XX |
Note: Run evaluation to generate actual results
- Source: http://purl.obolibrary.org/obo/hp.obo
- Description: Human Phenotype Ontology provides standardized vocabulary for phenotypic abnormalities
- Auto-download: First run automatically downloads latest version
- Source: http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt
- Description: Gene to phenotype annotations from HPO
- Fallback: Demo data included for testing
See ALGORITHMS.md for detailed description of:
- Semantic similarity calculations
- Feature engineering for ML model
- Scoring methodologies
- Algorithmic improvements
This project was developed with assistance from AI tools. See AI_USAGE.md for:
- Which parts used AI assistance
- How AI-generated content was validated
- Manual modifications and improvements
- MRR (Mean Reciprocal Rank): Average of 1/rank across all queries. Higher is better.
- Hit@K: Proportion of queries where true gene appears in top K results.
- Precision@K: Proportion of relevant items in top K results.
- NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality (0-1 scale).
# Run tests (when implemented)
pytest tests/
# Run with coverage
pytest --cov=phenorank tests/# Format code
black phenorank/
# Check style
flake8 phenorank/If you use this software in your research, please cite:
Phenotype Prioritization System
ML-Enhanced Gene Ranking for Genomic Interpretation
https://github.com/yourusername/phenotype-prioritization
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is provided for educational and research purposes.
- HPO download fails: Check internet connection or use demo data
- ML training fails: Ensure LightGBM/XGBoost is installed correctly
- Memory issues: Reduce number of test cases or use smaller gene set
- Check the documentation
- Review issues
- Contact the development team
- Phenotype matching: Robinson et al., "The Human Phenotype Ontology" (2008)
- Phen2Gene: Zhao et al., "Phen2Gene: rapid phenotype-driven gene prioritization" (2020)
- Exomiser: Smedley et al., "Next-generation diagnostics and disease-gene discovery" (2015)
- Learning to Rank: Liu, "Learning to Rank for Information Retrieval" (2009)