Name	Name	Last commit message	Last commit date
parent directory ..
01_quick_count.py	01_quick_count.py
02_quality_assessment.py	02_quality_assessment.py
03_data_profiler.py	03_data_profiler.py
04_recommender.py	04_recommender.py
05_production_pipeline.py	05_production_pipeline.py
07_DE_Import.py	07_DE_Import.py
08_Parameter_Optimization.py	08_Parameter_Optimization.py
09_Ensemble_Analysis.py	09_Ensemble_Analysis.py
10_qc_scenarios.py	10_qc_scenarios.py
11_recommender_scenarios.py	11_recommender_scenarios.py
README.md	README.md
example_biomarker_discovery.py	example_biomarker_discovery.py

RAPTOR Example Scripts

Complete User & Developer Guide (v2.2.0)

Author: Ayeh Bolouki
Package: raptor-rnaseq (PyPI)
Updated: 17 March 2025

Overview

This folder contains example scripts demonstrating the complete RAPTOR RNA-seq analysis workflow. Each script is self-contained with demo mode and can also accept real data.

Script Inventory

Script	Module	What It Does	Needs Tools?
`01_quick_count.py`	M1	FASTQ → quick counts (Salmon/Kallisto)	Yes (Salmon or Kallisto)
`02_quality_assessment.py`	M2	Count matrix → QC report (0-100 score)	No
`03_data_profiler.py`	M3	Count matrix → 32 statistical features	No
`04_recommender.py`	M4	Profile → pipeline recommendation	No
`05_production_pipeline.py`	M5	FASTQ → production counts (STAR/HISAT2)	Yes (STAR, HISAT2, etc.)
`07_DE_Import.py`	M7	R DE results → standardized format	No
`08_Parameter_Optimization.py`	M8	DE results → optimized thresholds	No
`09_Ensemble_Analysis.py`	M9	Multiple DE results → consensus genes	No
`10_qc_scenarios.py`	M2	Stress test: 4 QC scenarios	No
`11_recommender_scenarios.py`	M4	Stress test: 5 recommender scenarios	No

Note: M6 (R-based DE analysis) is performed in R, not Python.
Note: M10 (Biomarker Discovery) is planned for v3.0.0.

Quick Start

Prerequisites

# Install RAPTOR
pip install raptor-rnaseq

# Or editable install (for development)
cd RAPTOR-main
pip install -e .

# Verify
raptor --version
python -c "import raptor; print(raptor.__version__)"

Windows Users

Set UTF-8 encoding before running any example (Windows terminals use cp1252 by default which cannot render Unicode characters in the scripts):

$env:PYTHONIOENCODING = "utf-8"

Run the Full Chain (Demo Mode)

No real data needed — each script generates simulated data internally:

# Stage 1: Fast Profiling (M2 → M3 → M4)
python examples/02_quality_assessment.py                    # Creates results/qc/
python examples/03_data_profiler.py                         # Creates results/profile.json
python examples/04_recommender.py --profile results/profile.json  # Creates results/recommendation.json

# Stage 3: DE Analysis (M7 → M8 → M9)
python examples/07_DE_Import.py --demo                      # Creates results/de_imported/
python examples/08_Parameter_Optimization.py --demo         # Creates results/parameter_optimization/
python examples/09_Ensemble_Analysis.py --demo              # Creates results/ensemble_analysis/

Run with Real Data

# M2: QC with your count matrix
python examples/02_quality_assessment.py --counts my_counts.csv --metadata my_metadata.csv

# M3: Profile your data
python examples/03_data_profiler.py --counts results/qc/counts_clean.csv

# M4: Get recommendation based on your profile
python examples/04_recommender.py --profile results/profile.json

# M7: Import your R analysis results
python examples/07_DE_Import.py --de-file deseq2_results.csv --pipeline deseq2

# M8: Optimize parameters using M7 output
python examples/08_Parameter_Optimization.py --de-result results/de_imported/de_result.pkl

# M9: Combine multiple DE methods
python examples/09_Ensemble_Analysis.py --de-results results/de_imported/de_result.pkl

Workflow Architecture

STAGE 1: Fast Profiling (Python only, ~5 min)
═══════════════════════════════════════════════
  M1: FASTQ files ──→ quick_gene_counts.csv     (requires Salmon/Kallisto)
           │
  M2: counts.csv ──→ qc_results.json             02_quality_assessment.py
           │           counts_clean.csv
  M3: counts.csv ──→ profile.json (32 features)  03_data_profiler.py
           │
  M4: profile.json ──→ recommendation.json        04_recommender.py
           │             (DESeq2/edgeR/limma)


STAGE 2: Production Quantification (requires tools, ~1-2 hrs)
═══════════════════════════════════════════════════════════════
  M5: FASTQ files ──→ gene_counts.csv            05_production_pipeline.py
                       tx_counts.csv


STAGE 3: DE Analysis (R + Python, ~30 min)
══════════════════════════════════════════
  M6: counts ──→ deseq2_results.csv               (run in R)
                  edger_results.csv
                  limma_results.csv
           │
  M7: R output ──→ de_standardized.csv             07_DE_Import.py
           │        de_result.pkl
  M8: de_result ──→ optimized_parameters.json      08_Parameter_Optimization.py
           │
  M9: multiple ──→ consensus_genes.csv             09_Ensemble_Analysis.py
       methods      method_comparison.csv

Module Output Chain

Each module's output feeds into the next:

M2 output (counts_clean.csv) → M3 input
M3 output (profile.json)     → M4 input
M7 output (de_result.pkl)    → M8 input
M7 output (de_result.pkl)    → M9 input (multiple files)

Script Details

02_quality_assessment.py (Module 2)

Purpose: Comprehensive QC scoring with 6 assessment components.

# Default (uses M1 output)
python examples/02_quality_assessment.py

# Custom input
python examples/02_quality_assessment.py --counts my_counts.csv --metadata my_meta.csv --output results/my_qc

# CLI equivalent
raptor qc --counts my_counts.csv --metadata my_meta.csv

Output: results/qc/

qc_results.json — Full QC report with scores per component
counts_clean.csv — Cleaned count matrix (outliers removed)

QC Components:

Library Quality — library size variation, sequencing depth
Gene Detection — detection rate, zero inflation
Outlier Detection — Mahalanobis distance on PCA
Variance Structure — PCA variance explained
Batch Effects — batch-condition confounding (requires metadata)
Biological Signal — DE signal strength

Quality Score Interpretation:

90-100: Excellent — proceed with confidence
70-89: Good — minor issues, safe to proceed
50-69: Acceptable — review warnings before proceeding
<50: Poor — address issues before DE analysis

03_data_profiler.py (Module 3)

Purpose: Extract 32 statistical features for ML-based pipeline recommendation.

# Uses M2 output
python examples/03_data_profiler.py --counts results/qc/counts_clean.csv

# CLI equivalent
raptor profile --counts results/qc/counts_clean.csv

Output: results/profile.json — 32 features including:

Sample characteristics (n_samples, n_genes, n_groups, balance)
Library size metrics (mean, CV, range, depth category)
Gene detection (detection rate, reliable expression)
Expression distribution (mean, variance, skewness on log2 scale)
Dispersion (BCV, common dispersion, overdispersion — critical for M4)
Sparsity (zero proportion, zero inflation index)
Mean-variance relationship (slope, Poisson/NB fit scores)
Quality indicators (outliers, batch effects)

04_recommender.py (Module 4)

Purpose: Recommend the optimal DE analysis pipeline based on data characteristics.

# From profile (recommended)
python examples/04_recommender.py --profile results/profile.json

# From counts (auto-profiles first)
python examples/04_recommender.py --counts results/qc/counts_clean.csv

# CLI equivalent
raptor recommend --profile results/profile.json

Output: results/recommendation.json — Includes:

Primary pipeline (DESeq2, edgeR, limma-voom, or Wilcoxon)
Alternative pipeline
Confidence scores (0-100) for all 5 pipelines
Decision factors and warnings
R code snippets for the recommended pipeline

How the recommender decides:

Data Characteristic	Favors
Small samples (n < 8)	DESeq2 (best shrinkage)
Large samples (n > 20)	limma-voom (fast, powerful)
High dispersion / sparsity	edgeR (robust NB model)
Outlier samples detected	edgeR_robust
Very large samples (n > 50)	Wilcoxon (non-parametric)

07_DE_Import.py (Module 7)

Purpose: Import and standardize DE results from R analysis tools.

# Demo mode
python examples/07_DE_Import.py --demo

# Import DESeq2 results
python examples/07_DE_Import.py --de-file deseq2_results.csv --pipeline deseq2

# Auto-detect pipeline format
python examples/07_DE_Import.py --de-file my_de_results.csv

# CLI equivalent
raptor import-de --de-file deseq2_results.csv

Supported R formats:

Pipeline	Expected Columns
DESeq2	gene_id, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
edgeR	gene_id, logFC, logCPM, LR, PValue, FDR
limma-voom	gene_id, logFC, AveExpr, t, P.Value, adj.P.Val, B
Wilcoxon	gene_id, logFC, pvalue, padj

Output: results/de_imported/

de_standardized.csv — All genes, standardized column names
de_significant.csv — Significant genes only (FDR < 0.05)
de_summary.json — Summary statistics
de_result.pkl — Serialized DEResult object for M8/M9

08_Parameter_Optimization.py (Module 8)

Purpose: Find optimal FDR and log2FC thresholds for DE analysis.

# Demo mode
python examples/08_Parameter_Optimization.py --demo

# From M7 output
python examples/08_Parameter_Optimization.py --de-result results/de_imported/de_result.pkl

# CLI equivalent
raptor optimize --de-result results/de_imported/de_result.pkl

Optimization methods:

FDR Control — Statistical FDR estimation (Storey & Tibshirani, 2003)
Ground Truth — Uses validated gene lists (if available)
Stability — Cross-validation stability of gene lists
Reproducibility — Agreement between cohorts

Output: results/parameter_optimization/

optimization_results.json — Optimal alpha and LFC thresholds

Example result: alpha=0.03, LFC=0.25 means use results(dds, alpha=0.03, lfcThreshold=0.25) in R.

09_Ensemble_Analysis.py (Module 9)

Purpose: Combine DE results from multiple methods into high-confidence consensus gene lists.

# Demo mode
python examples/09_Ensemble_Analysis.py --demo

# CLI equivalent
raptor ensemble --de-results results/de_imported/de_result.pkl

5 Ensemble methods:

Method	Approach	Best For
Voting	Gene must pass ≥2 methods	High-confidence publication lists
Fisher	Combine p-values (Fisher, 1925)	Maximum sensitivity / exploration
Brown	Correlation-aware p-value combination	Conservative, accounts for method similarity
RRA	Robust Rank Aggregation (Kolde, 2012)	Robust to outlier methods
Weighted	Performance-weighted combination	When you have validation data

Output: results/ensemble_analysis/

voting/consensus_genes.csv
fisher/consensus_genes.csv
brown/consensus_genes.csv
rra/consensus_genes.csv
weighted/consensus_genes.csv
method_comparison.csv — Side-by-side comparison
ensemble_summary.json

10_qc_scenarios.py (Stress Test)

Purpose: Verify RAPTOR QC handles different RNA-seq data quality problems.

python examples/10_qc_scenarios.py

4 Scenarios tested:

Scenario	What It Simulates	Expected Detection
Gold Standard	8 samples, 200 DE genes, balanced batches	Score > 85, all green
Outlier + Library Issues	1 sample at 10%, library sizes vary 5x	Library quality warning
Confounded Batch	Batch1 = Controls, Batch2 = Treatments	Confounding warning
Sparse Data	70% zeros, 4 samples	Zero inflation, low signal

11_recommender_scenarios.py (Stress Test)

Purpose: Verify the recommender adapts to different data characteristics.

python examples/11_recommender_scenarios.py

5 Scenarios tested:

Scenario	Expected Primary	Why
Small samples (n=3)	DESeq2	Best shrinkage for few samples
Large samples (n=20)	limma-voom	Fast, powerful at large n
High sparsity (60% zeros)	edgeR	Handles sparse data best
Outlier present	edgeR_robust	Robust to outlier samples
Confounded batch	Any + warning	Must warn about confounding

CLI Equivalents

Every example script has a CLI equivalent using raptor commands:

# Module 2: Quality Assessment
raptor qc --counts counts.csv --metadata metadata.csv --output results/qc

# Module 3: Data Profiling
raptor profile --counts counts.csv --output results

# Module 4: Pipeline Recommendation
raptor recommend --counts counts.csv
raptor recommend --profile results/profile.json

# Module 5: Pipeline Management
raptor pipeline list
raptor pipeline run --pipeline salmon --sample-sheet samples.csv --index /path/to/index

# Module 7: DE Import
raptor import-de --de-file deseq2_results.csv --pipeline deseq2

# Module 8: Parameter Optimization
raptor optimize --de-result results/de_imported/de_result.pkl

# Module 9: Ensemble Analysis
raptor ensemble --de-results results/de_imported/de_result.pkl

# General
raptor --version
raptor --help

Input File Formats

Count Matrix (CSV)

,Sample1,Sample2,Sample3,Sample4
ENSG00000000001,150,230,180,210
ENSG00000000002,45,62,38,55
ENSG00000000003,0,3,1,2

First column: gene IDs (Ensembl, gene symbol, or any identifier)
Remaining columns: raw integer counts per sample
No negative values
Minimum: 10 genes, 2 samples

Metadata (CSV)

sample_id,condition,batch
Sample1,Control,Batch1
Sample2,Control,Batch1
Sample3,Treatment,Batch2
Sample4,Treatment,Batch2

sample_id: must match column names in count matrix
condition: experimental groups (required for M4 recommendation)
batch: batch information (optional, used for batch effect detection)

DE Results from R (CSV)

See Module 7 section for expected column formats per pipeline.

Output Directory Structure

After running all examples:

results/
├── quick_counts/              # M1 output
│   ├── quick_gene_counts.csv
│   ├── quick_tpm.csv
│   └── sample_info.csv
├── qc/                        # M2 output
│   ├── qc_results.json
│   └── counts_clean.csv
├── profile.json               # M3 output
├── recommendation.json        # M4 output
├── de_imported/               # M7 output
│   ├── de_standardized.csv
│   ├── de_significant.csv
│   ├── de_summary.json
│   └── de_result.pkl
├── parameter_optimization/    # M8 output
│   └── fdr_control/
│       └── optimization_results.json
└── ensemble_analysis/         # M9 output
    ├── voting/consensus_genes.csv
    ├── fisher/consensus_genes.csv
    ├── brown/consensus_genes.csv
    ├── rra/consensus_genes.csv
    ├── weighted/consensus_genes.csv
    ├── method_comparison.csv
    └── ensemble_summary.json

Troubleshooting

Windows: UnicodeEncodeError (cp1252)

UnicodeEncodeError: 'charmap' codec can't encode characters

Fix: Set UTF-8 encoding before running:

$env:PYTHONIOENCODING = "utf-8"

Module Not Found / Import Error

ModuleNotFoundError: No module named 'raptor'

Fix: Install in editable mode:

cd RAPTOR-main
pip install -e .

Count File Not Found

Count file not found: results/quick_counts/quick_gene_counts.csv

Fix: Either run M1 first, or create simulated data:

import pandas as pd, numpy as np, os
os.makedirs('results/quick_counts', exist_ok=True)
np.random.seed(42)
counts = pd.DataFrame(
    np.random.negative_binomial(10, 0.05, (1000, 6)),
    index=[f'Gene_{i}' for i in range(1000)],
    columns=[f'Sample_{i}' for i in range(1, 7)]
)
counts.to_csv('results/quick_counts/quick_gene_counts.csv')

--demo Flag

Scripts M7, M8, M9 require --demo flag when running without real DE results:

python examples/07_DE_Import.py --demo
python examples/08_Parameter_Optimization.py --demo
python examples/09_Ensemble_Analysis.py --demo

For Developers

Adding a New Example Script

Name it XX_description.py where XX is the module number
Include demo mode (--demo flag) that generates simulated data
Use argparse with --help documentation
Import from raptor package (with try/except fallback for demo mode)
Save output to results/ following the directory convention
Print a completion summary with next steps
Test on both Windows (PowerShell) and Linux

Testing All Examples

# Set encoding (Windows only)
$env:PYTHONIOENCODING = "utf-8"

# Run all examples
python examples/02_quality_assessment.py
python examples/03_data_profiler.py
python examples/04_recommender.py --profile results/profile.json
python examples/07_DE_Import.py --demo
python examples/08_Parameter_Optimization.py --demo
python examples/09_Ensemble_Analysis.py --demo

# Run stress tests
python examples/10_qc_scenarios.py
python examples/11_recommender_scenarios.py

Common Example Script Bugs

These patterns caused bugs in v2.2.0 example scripts — avoid in new scripts:

Bug Pattern	Example	Fix
Wrong dataclass field names	`EnsembleResult(method=...)`	Check `ClassName.__dataclass_fields__.keys()`
`to_dict()` on a dict	`result.to_dict() if hasattr(...)`	Check `isinstance(result, dict)` first
Wrong attribute names in display	`getattr(profile, 'lib_size_mean')`	Check actual DataProfile fields
numpy int64 in json.dump	`json.dump(data)` crashes	Always use `json.dump(data, default=str)`
Windows path in assertion	`'results/quick_counts' in str(path)`	Use `'quick_counts' in str(path)`

Version History

v2.2.0 (March 2025): Initial release with all example scripts
v2.2.0-fix (March 2025): Fixed display bugs in 02, 03, 08, 09 example scripts
Scripts 10, 11 added for stress testing QC and recommender modules

RAPTOR: Making free science for everybody around the world

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

RAPTOR Example Scripts

Complete User & Developer Guide (v2.2.0)

Overview

Script Inventory

Quick Start

Prerequisites

Windows Users

Run the Full Chain (Demo Mode)

Run with Real Data

Workflow Architecture

Module Output Chain

Script Details

02_quality_assessment.py (Module 2)

03_data_profiler.py (Module 3)

04_recommender.py (Module 4)

07_DE_Import.py (Module 7)

08_Parameter_Optimization.py (Module 8)

09_Ensemble_Analysis.py (Module 9)

10_qc_scenarios.py (Stress Test)

11_recommender_scenarios.py (Stress Test)

CLI Equivalents

Input File Formats

Count Matrix (CSV)

Metadata (CSV)

DE Results from R (CSV)

Output Directory Structure

Troubleshooting

Windows: UnicodeEncodeError (cp1252)

Module Not Found / Import Error

Count File Not Found

--demo Flag

For Developers

Adding a New Example Script

Testing All Examples

Common Example Script Bugs

Version History

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

RAPTOR Example Scripts

Complete User & Developer Guide (v2.2.0)

Overview

Script Inventory

Quick Start

Prerequisites

Windows Users

Run the Full Chain (Demo Mode)

Run with Real Data

Workflow Architecture

Module Output Chain

Script Details

02_quality_assessment.py (Module 2)

03_data_profiler.py (Module 3)

04_recommender.py (Module 4)

07_DE_Import.py (Module 7)

08_Parameter_Optimization.py (Module 8)

09_Ensemble_Analysis.py (Module 9)

10_qc_scenarios.py (Stress Test)

11_recommender_scenarios.py (Stress Test)

CLI Equivalents

Input File Formats

Count Matrix (CSV)

Metadata (CSV)

DE Results from R (CSV)

Output Directory Structure

Troubleshooting

Windows: UnicodeEncodeError (cp1252)

Module Not Found / Import Error

Count File Not Found

--demo Flag

For Developers

Adding a New Example Script

Testing All Examples

Common Example Script Bugs

Version History