Skip to content

schavan10/Phenotype-prioritization

Repository files navigation

Phenotype Prioritization System

A modern, production-ready Python application for ranking genes based on phenotype similarity using the Human Phenotype Ontology (HPO). This system combines multiple state-of-the-art algorithms with machine learning to improve diagnostic yield in genomic interpretation.

🎯 Overview

Phenotype ranking is crucial for tertiary analysis in genomic interpretation. This application:

  • Integrates patient phenotypic data (HPO terms) with genomic variant information
  • Scores and ranks genes based on their likelihood of explaining observed phenotypes
  • Implements multiple algorithms including semantic similarity, Phen2Gene-like, and Exomiser-like approaches
  • Enhances ranking with ML using gradient boosting (LightGBM) for optimal feature weighting
  • Provides comprehensive evaluation with statistical analysis and visualizations

✨ Features

Algorithms Implemented

  1. Semantic Similarity Ranker

    • Information Content (IC) based semantic similarity
    • Resnik and Lin similarity metrics
    • Best Match Average (BMA) scoring
  2. Phen2Gene-like Ranker

    • IC-weighted phenotype matching
    • Phenotype specificity scoring
    • Hierarchical term matching
  3. Exomiser-like Ranker

    • Multi-component scoring (similarity + frequency + coverage)
    • Disease association integration
    • SimJ semantic similarity
  4. ML-Enhanced Ranker ⭐ NEW

    • 21 comprehensive features extracted from gene-phenotype pairs
    • LightGBM/XGBoost learning-to-rank
    • Optimal feature weight learning from data
    • Non-linear interaction modeling

Evaluation Framework

  • Comprehensive metrics: MRR, Hit@K, Precision, Recall, NDCG
  • Statistical significance testing (Wilcoxon signed-rank test)
  • Rank distribution analysis
  • Benchmark dataset generation with controlled noise levels

πŸ“‹ Requirements

  • Python 3.9+
  • Poetry (for dependency management)

πŸš€ Installation

Option 1: Using Poetry (Recommended)

# Clone the repository
cd Phenotype-priortization

# Install dependencies with Poetry
poetry install

# Activate the virtual environment
poetry shell

Option 2: Using pip

# Install dependencies
pip install numpy pandas scikit-learn scipy networkx matplotlib seaborn plotly
pip install pronto requests tqdm joblib
pip install lightgbm xgboost optuna

# Install the package in development mode
pip install -e .

πŸ“Š Quick Start

1. Run Complete Evaluation

Run the comprehensive evaluation pipeline that trains models, evaluates all algorithms, and generates visualizations:

python run_evaluation.py

This will:

  • Load HPO ontology and gene-phenotype associations
  • Generate 100 synthetic test cases
  • Train the ML-enhanced model
  • Evaluate all algorithms
  • Perform statistical analysis
  • Generate comparison plots and reports

Results will be saved to the results/ directory.

2. Use the Command Line Interface

Rank genes for specific phenotypes:

# Using ML-enhanced algorithm
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263,HP:0002376 --algorithm ml

# Using baseline algorithms
python -m phenorank.cli rank --phenotypes HP:0001250,HP:0001263 --algorithm phen2gene

Search for HPO terms:

python -m phenorank.cli search "seizures"

Get information about a gene or HPO term:

python -m phenorank.cli info HP:0001250
python -m phenorank.cli info SCN1A

3. Use as a Python Library

from phenorank.data.hpo_loader import HPOLoader
from phenorank.data.gene_loader import GeneLoader
from phenorank.models.phenotype_profile import PhenotypeProfile
from phenorank.ml.ml_ranker import MLRanker

# Load data
hpo_loader = HPOLoader()
hpo_terms = hpo_loader.load_hpo()

gene_loader = GeneLoader()
genes = gene_loader.load_genes()

# Initialize ML ranker and load trained model
ranker = MLRanker(hpo_terms, genes)
ranker.load_model("models/ml_ranker.pkl")

# Create patient phenotype profile
profile = PhenotypeProfile(
    patient_id="Patient001",
    hpo_terms=["HP:0001250", "HP:0001263", "HP:0002376"]  # Seizures, developmental delay, regression
)

# Rank genes
result = ranker.rank_genes(profile)

# Display top 10 genes
for gene_score in result.get_top_n(10):
    print(f"{gene_score.rank}. {gene_score.gene_symbol}: {gene_score.score:.4f}")

πŸ“ Project Structure

Phenotype-priortization/
β”œβ”€β”€ phenorank/                  # Main package
β”‚   β”œβ”€β”€ algorithms/             # Ranking algorithms
β”‚   β”‚   β”œβ”€β”€ base.py            # Base ranker class
β”‚   β”‚   β”œβ”€β”€ semantic_similarity.py
β”‚   β”‚   β”œβ”€β”€ phen2gene.py
β”‚   β”‚   └── exomiser_like.py
β”‚   β”œβ”€β”€ ml/                     # ML-enhanced ranking
β”‚   β”‚   β”œβ”€β”€ ml_ranker.py       # ML ranker implementation
β”‚   β”‚   └── feature_extractor.py  # Feature engineering
β”‚   β”œβ”€β”€ models/                 # Data models
β”‚   β”‚   β”œβ”€β”€ hpo_term.py
β”‚   β”‚   β”œβ”€β”€ gene.py
β”‚   β”‚   β”œβ”€β”€ phenotype_profile.py
β”‚   β”‚   └── ranking_result.py
β”‚   β”œβ”€β”€ data/                   # Data loading
β”‚   β”‚   β”œβ”€β”€ hpo_loader.py
β”‚   β”‚   └── gene_loader.py
β”‚   β”œβ”€β”€ evaluation/             # Evaluation framework
β”‚   β”‚   β”œβ”€β”€ metrics.py
β”‚   β”‚   β”œβ”€β”€ evaluator.py
β”‚   β”‚   └── benchmark.py
β”‚   β”œβ”€β”€ utils/                  # Utilities
β”‚   β”‚   β”œβ”€β”€ logging_config.py
β”‚   β”‚   └── visualization.py
β”‚   └── cli.py                  # Command line interface
β”œβ”€β”€ run_evaluation.py           # Main evaluation script
β”œβ”€β”€ pyproject.toml             # Poetry dependencies
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ ALGORITHMS.md              # Algorithm details
└── AI_USAGE.md                # AI tool usage documentation

πŸ“ˆ Evaluation Results

After running run_evaluation.py, you'll find:

Metrics Files

  • results/comparison_metrics.csv - Detailed metrics for all algorithms
  • results/statistical_tests.csv - Pairwise statistical significance tests
  • results/EVALUATION_REPORT.md - Comprehensive summary report

Visualizations

  • results/metrics_comparison.png - Bar chart comparison
  • results/radar_comparison.html - Interactive radar chart
  • results/rank_distributions.png - Distribution comparison
  • results/rank_dist_<algorithm>.png - Individual distributions

Example Results

Algorithm MRR Mean Rank Hit@1 Hit@10 NDCG@10
MLRanker 0.XX XX.X 0.XX 0.XX 0.XX
ExomiserLike 0.XX XX.X 0.XX 0.XX 0.XX
Phen2Gene 0.XX XX.X 0.XX 0.XX 0.XX
SemanticSimilarity 0.XX XX.X 0.XX 0.XX 0.XX

Note: Run evaluation to generate actual results

🧬 Data Sources

HPO Ontology

  • Source: http://purl.obolibrary.org/obo/hp.obo
  • Description: Human Phenotype Ontology provides standardized vocabulary for phenotypic abnormalities
  • Auto-download: First run automatically downloads latest version

Gene-Phenotype Associations

πŸ”¬ Algorithm Details

See ALGORITHMS.md for detailed description of:

  • Semantic similarity calculations
  • Feature engineering for ML model
  • Scoring methodologies
  • Algorithmic improvements

πŸ€– AI Usage

This project was developed with assistance from AI tools. See AI_USAGE.md for:

  • Which parts used AI assistance
  • How AI-generated content was validated
  • Manual modifications and improvements

πŸ“Š Performance Metrics Explained

  • MRR (Mean Reciprocal Rank): Average of 1/rank across all queries. Higher is better.
  • Hit@K: Proportion of queries where true gene appears in top K results.
  • Precision@K: Proportion of relevant items in top K results.
  • NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality (0-1 scale).

πŸ› οΈ Development

Running Tests

# Run tests (when implemented)
pytest tests/

# Run with coverage
pytest --cov=phenorank tests/

Code Formatting

# Format code
black phenorank/

# Check style
flake8 phenorank/

πŸ“ Citation

If you use this software in your research, please cite:

Phenotype Prioritization System
ML-Enhanced Gene Ranking for Genomic Interpretation
https://github.com/yourusername/phenotype-prioritization

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

πŸ“„ License

This project is provided for educational and research purposes.

πŸ†˜ Troubleshooting

Common Issues

  1. HPO download fails: Check internet connection or use demo data
  2. ML training fails: Ensure LightGBM/XGBoost is installed correctly
  3. Memory issues: Reduce number of test cases or use smaller gene set

Getting Help

πŸŽ“ References

Key Papers

  1. Phenotype matching: Robinson et al., "The Human Phenotype Ontology" (2008)
  2. Phen2Gene: Zhao et al., "Phen2Gene: rapid phenotype-driven gene prioritization" (2020)
  3. Exomiser: Smedley et al., "Next-generation diagnostics and disease-gene discovery" (2015)
  4. Learning to Rank: Liu, "Learning to Rank for Information Retrieval" (2009)

Resources


About

ML-Enhanced Phenotype Prioritization for Genomic Interpretation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages