Making free science for everybody around the world 🌍
Quick Start • Features • Installation • Documentation • Pipelines • Citation
Stop using arbitrary thresholds! The new Adaptive Threshold Optimizer determines data-driven significance cutoffs for differential expression analysis.
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
df = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(df, goal='discovery')
print(f"Optimal logFC: {result.logfc_threshold:.2f}")
print(f"Significant genes: {result.n_significant}")
print(f"\n{result.methods_text}") # Publication-ready!Key Features:
- Multiple p-value adjustment methods (BH, BY, Storey q-value, Holm, Bonferroni)
- Five logFC optimization methods (MAD, mixture model, power-based, percentile, consensus)
- π₀ estimation for true null proportion
- Three analysis goals: discovery, balanced, validation
- Auto-generated publication methods text
- Interactive dashboard integration
RAPTOR is a comprehensive framework for benchmarking and optimizing RNA-seq differential expression analysis pipelines. Instead of guessing which pipeline works best for your data, RAPTOR provides evidence-based, ML-powered recommendations through systematic comparison of 8 popular pipelines.
| Challenge | RAPTOR Solution |
|---|---|
| Which pipeline should I use? | ✅ ML recommendations with 87% accuracy |
| What thresholds should I use? | ✅ Adaptive Threshold Optimizer (NEW!) |
| Is my data quality good enough? | ✅ Quality assessment with batch effect detection |
| How do I know results are reliable? | ✅ Ensemble analysis combining multiple pipelines |
| What resources do I need? | ✅ Resource monitoring with predictions |
| How do I present results? | ✅ Automated reports publication-ready |
|
|
# Install
pip install raptor-rnaseq
# Launch dashboard
raptor dashboard
# Opens at http://localhost:8501
# Upload data → Get ML recommendation → Use 🎯 Threshold Optimizer → Done!# Profile your data and get ML recommendation
raptor profile --counts counts.csv --metadata metadata.csv --use-ml
# Run recommended pipeline
raptor run --pipeline 3 --data fastq/ --output results/
# Optimize thresholds (NEW!)
raptor optimize-thresholds --input results.csv --goal balanced
# Generate report
raptor report --results results/ --output report.htmlfrom raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
# Profile your data
profiler = RNAseqDataProfiler(counts, metadata)
profile = profiler.run_full_profile()
# Get ML recommendation
recommender = MLPipelineRecommender()
recommendation = recommender.recommend(profile)
print(f"Recommended: Pipeline {recommendation['pipeline_id']}")
print(f"Confidence: {recommendation['confidence']:.1%}")
# After running pipeline, optimize thresholds (NEW!)
de_results = pd.read_csv('de_results.csv')
result = optimize_thresholds(de_results, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.2f}")
print(result.methods_text)- Python: 3.8 or higher
- R: 4.0 or higher (for DE analysis)
- RAM: 8GB minimum (16GB recommended)
- Disk: 10GB free space
pip install raptor-rnaseqWith optional dependencies:
# With dashboard support
pip install raptor-rnaseq[dashboard]
# With all features
pip install raptor-rnaseq[all]# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR
# Install Python dependencies
pip install -r requirements.txt
# Verify installation
python install.pyconda env create -f environment.yml
conda activate raptorRAPTOR benchmarks 8 RNA-seq analysis pipelines:
| ID | Pipeline | Aligner | Quantifier | DE Tool | Speed | ML Rank |
|---|---|---|---|---|---|---|
| 1 | STAR-RSEM-DESeq2 | STAR | RSEM | DESeq2 | ⭐⭐ | #2 |
| 2 | HISAT2-StringTie-Ballgown | HISAT2 | StringTie | Ballgown | ⭐⭐⭐ | #5 |
| 3 | Salmon-edgeR ⭐ | Salmon | Salmon | edgeR | ⭐⭐⭐⭐⭐ | #1 |
| 4 | Kallisto-Sleuth | Kallisto | Kallisto | Sleuth | ⭐⭐⭐⭐⭐ | #3 |
| 5 | STAR-HTSeq-limma | STAR | HTSeq | limma-voom | ⭐⭐ | #4 |
| 6 | STAR-featureCounts-NOISeq | STAR | featureCounts | NOISeq | ⭐⭐ | #6 |
| 7 | Bowtie2-RSEM-EBSeq | Bowtie2 | RSEM | EBSeq | ⭐⭐ | #7 |
| 8 | HISAT2-Cufflinks-Cuffdiff | HISAT2 | Cufflinks | Cuffdiff | ⭐ | #8 |
⭐ Pipeline 3 (Salmon-edgeR) is the ML model's most frequently recommended pipeline due to its optimal speed/accuracy balance.
RAPTOR/
├── raptor/ # Core Python package
│ ├── profiler.py # Data profiling
│ ├── recommender.py # Rule-based recommendations
│ ├── ml_recommender.py # ML recommendations
│ ├── threshold_optimizer/ # 🆕 Adaptive Threshold Optimizer (v2.1.2)
│ │ ├── __init__.py
│ │ ├── ato.py # Core ATO class
│ │ └── visualization.py # ATO visualizations
│ ├── data_quality_assessment.py
│ ├── ensemble_analysis.py
│ ├── resource_monitoring.py
│ └── ...
├── dashboard/ # Interactive web dashboard
├── pipelines/ # Pipeline configurations (8 pipelines)
├── scripts/ # Workflow scripts (00-10)
├── examples/ # Example scripts & demos
├── tests/ # Test suite
├── docs/ # Documentation
├── config/ # Configuration templates
├── install.py # Master installer
├── launch_dashboard.py # Dashboard launcher
├── requirements.txt # Python dependencies
└── setup.py # Package setup
| Document | Description |
|---|---|
| INSTALLATION.md | Detailed installation guide |
| QUICK_START.md | 5-minute quick start |
| DASHBOARD.md | Interactive dashboard guide |
| Document | Description |
|---|---|
| THRESHOLD_OPTIMIZER.md | 🆕 Adaptive threshold optimization |
| PROFILE_RECOMMEND.md | Data profiling & recommendations |
| QUALITY_ASSESSMENT.md | Quality scoring & batch effects |
| BENCHMARKING.md | Pipeline benchmarking |
| Document | Description |
|---|---|
| ENSEMBLE.md | Multi-pipeline ensemble analysis |
| RESOURCE_MONITORING.md | Resource tracking |
| CLOUD_DEPLOYMENT.md | AWS/GCP/Azure deployment |
| Document | Description |
|---|---|
| PIPELINES.md | Pipeline details & selection guide |
| API.md | Python API reference |
| FAQ.md | Frequently asked questions |
| CHANGELOG.md | Version history |
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# Load DE results
df = pd.read_csv('deseq2_results.csv')
# Optimize thresholds
result = optimize_thresholds(df, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.3f}")
print(f"Significant genes: {result.n_significant}")
print(f"π₀ estimate: {result.pi0:.3f}")
# Get publication methods text
print(result.methods_text)
# Save results
result.results_df.to_csv('optimized_results.csv')from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# 1. Profile data
counts = pd.read_csv('counts.csv', index_col=0)
metadata = pd.read_csv('metadata.csv')
profiler = RNAseqDataProfiler(counts, metadata, use_ml=True)
profile = profiler.profile(quality_check=True)
print(f"Quality Score: {profile['quality_score']}/100")
# 2. Get ML recommendation
recommender = MLPipelineRecommender()
recommendations = recommender.recommend(profile, n=3)
print(f"Recommended: {recommendations[0]['pipeline_name']}")
# 3. [Run recommended pipeline - produces DE results]
# raptor run --pipeline 3 ...
# 4. Optimize thresholds (NEW in v2.1.2)
de_results = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(
de_results,
logfc_col='log2FoldChange',
pvalue_col='pvalue',
goal='balanced'
)
print(f"\n🎯 Optimized Thresholds:")
print(f" LogFC: |{result.logfc_threshold:.3f}|")
print(f" Significant: {result.n_significant} genes")
# 5. Save results with methods text
result.results_df.to_csv('final_results.csv')
with open('methods.txt', 'w') as f:
f.write(result.methods_text)from raptor.ensemble_analysis import EnsembleAnalyzer
from raptor.threshold_optimizer import optimize_thresholds
# Combine results from multiple pipelines
analyzer = EnsembleAnalyzer()
consensus = analyzer.combine_results(
results_dict={'deseq2': df1, 'edger': df2, 'limma': df3},
method='weighted_vote',
min_agreement=2
)
# Use ATO for uniform thresholds across ensemble
result = optimize_thresholds(consensus['combined'], goal='balanced')
print(f"Consensus DE genes: {result.n_significant}")| Metric | Value |
|---|---|
| Overall Accuracy | 87% |
| Top-3 Accuracy | 96% |
| Prediction Time | <0.1s |
| Training Data | 10,000+ analyses |
| Metric | Traditional | With ATO |
|---|---|---|
| Threshold justification | Arbitrary | Data-driven |
| Methods text | Manual | Auto-generated |
| False positives | Higher | Optimized |
| Reproducibility | Variable | Standardized |
We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
pytest tests/
# Submit pull requestSee CONTRIBUTING.md for guidelines.
If you use RAPTOR in your research, please cite:
@software{bolouki2025raptor,
author = {Bolouki, Ayeh},
title = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
year = {2025},
version = {2.1.1},
publisher = {Zenodo},
doi = {10.5281/zenodo.17607161},
url = {https://github.com/AyehBlk/RAPTOR}
}This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Ayeh Bolouki
Ayeh Bolouki
- 🏛️ GIGA, University of Liège, Belgium
- 📧 Email: [email protected]
- 🐙 GitHub: @AyehBlk
- 🔬 Research: Computational Biology, Bioinformatics, Multi-omics Analysis
- The Bioconductor community for the R package ecosystem
- All users who provided feedback
⭐ Star this repository if you find RAPTOR useful!
RAPTOR v2.1.2 - Making pipeline selection evidence-based, not guesswork 🦖