Protein Feature Enrichment Score (PFES)

Protein Feature Enrichment Score (PFES) is an interpretable framework for missense variant interpretation. Rather than outputting a single pathogenicity probability, PFES quantifies the degree to which a variant's protein characteristics statistically resemble known pathogenic or benign variation, and decomposes that signal into six feature attribute categories: physicochemical properties, 3D structure, domain/region annotations, functional sites, post-translational modifications, and protein-protein interactions.

This repository contains the data, analysis notebooks, and source code accompanying the manuscript: BioRxiv 2025

Repository Structure

missense-pfes/
├── data/                        # Pre-Preprocessed protein feature annotations for case/control datasets
├── notebooks/                   # Analysis notebooks reproducing key figures and results
├── results/                     # Output files - precomputed enrichment odds ratios and p-values, scored case and control datasets from batch pfes scorer, example output folder from PFES Colab
├── src/pfes/                    # PFES batch scorer (installable as a CLI tool)
└── glossary_protein_feature.md  # Definitions of all 103 protein features used in PFES

Quick Start: Query a variant(s)

The easiest way to use PFES is through our Google Colab notebook, which requires no local installation:

The notebook returns two outputs for any queried variant(s):

PFES report — overall PFES, PFES partitioning category with statistical significance, and a breakdown of contributions across six attribute categories with plain-language interpretation of each enriched feature
Protein-wide mutational landscape — a heatmap of PFES across all possible amino acid substitutions in the queried protein, decomposed by attribute category, showing where the variant of interest sits within the full substitution space

PFES Batch Scorer

src/pfes/scorer.py computes PFES scores for a batch of missense variants from a TSV/CSV input file. It fetches protein feature data via the g2papi package and the enrichment table from this repository.

Installation

pip install missense-pfes

Or install from source:

git clone https://github.com/broadinstitute/missense-pfes.git
cd missense-pfes
pip install .

Input format

The input file should be a TSV or CSV with the following columns:

Column	Description
`Gene`	HGNC gene symbol
`UniProt`	UniProt accession
`ResID`	Residue position
`RefAA`	Reference amino acid (single-letter)
`AltAA`	Alternate amino acid (single-letter)

Usage

usage: pfes [-h] -i INPUT -o OUTPUT [--rerun] [--log LOG] [--workers WORKERS]

Compute PFES scores for a batch of missense variants.

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input TSV or CSV (columns: Gene, UniProt, ResID, RefAA, AltAA).
                        When --rerun is set, use the previous output file.
  -o OUTPUT, --output OUTPUT
                        Output TSV or CSV path (format inferred from extension).
  --rerun               Only process rows where PFES is NaN (fill missing scores).
  --log LOG             Path for error log (default: pfes_errors.log).
  --workers WORKERS     Number of parallel workers (default: 1).

# Basic usage
pfes -i variants.tsv -o scored_variants.tsv

# Parallel processing
pfes -i variants.tsv -o scored_variants.tsv --workers 4

# Specify a custom error log path
pfes -i variants.tsv -o scored_variants.tsv --log my_errors.log...

Output retains the five input columns (Gene, UniProt, ResID, RefAA, AltAA) and appends seven score columns: PFES, PFES_Physicochemical, PFES_Structure, PFES_Domain, PFES_Function, PFES_Modification, and PFES_PPI.

Handling failures and missing scores

Scoring requires fetching protein feature data from g2papi at runtime. Individual proteins or variants may fail due to missing data or network issues, in which case their PFES columns are left as NaN in the output. Details of each failure — including the gene, UniProt accession, residue, and error message — are written to a log file (pfes_errors.log by default).

To retry only the failed variants without re-running the entire dataset, pass the previous output file as input with --rerun:

# Fill in missing scores from a previous run
pfes -i scored_variants.tsv -o scored_variants.tsv --rerun

This fills in any rows where PFES is NaN and leaves successfully scored rows untouched.

Data

Preprocessed protein feature annotations for the pathogenic (case) and control (benign + common population) variant datasets used in the paper. See glossary_protein_feature.md for definitions and sources of all 103 protein features.

File	Description
`data/case_annotation.tsv`	Protein feature annotations for 85,321 pathogenic variants
`data/control_annotation.tsv`	Protein feature annotations for 130,832 control variants

Notebooks

Notebook	Description
`notebooks/enrichment_by_protein_class.ipynb`	Runs Fisher's exact test enrichment analysis within each PANTHER protein functional class and visualizes odds ratios. Generates `results/enrichment_OR_by_protein_class.csv`.
`notebooks/PFES_empirical_stats.ipynb`	Computes empirical PFES distributions and derives partitioning thresholds (PF-Enriched/Neutral/Depleted) from the scored case and control datasets.
`notebooks/G2P_PFES.ipynb`	Colab notebook for querying PFES scores and reports for individual variants of a gene of interest.

Results

File	Description
`results/case_pfes.tsv`	The scored (overall and per attribute) pathogenic variants, generated by the PFES batch scorer
`results/control_pfes.tsv`	The scored (overall and per attribute) control variants, generated by the PFES batch scorer
`results/pfes_pvalue_lookup.tsv`	Empirical p-value lookup table for variant partitioning, generated by `PFES_empirical_stats.ipynb`
`results/enrichment_OR_by_protein_class.csv`	Odds ratios from Fisher's exact tests for all 103 protein features across 20 PANTHER protein functional classes, generated by `enrichment_by_protein_class.ipynb`

Citation

If you use PFES in your work, please cite:

Human Proteome-wide Mechanistic Interpretation of Missense Mutations through Protein Feature Enrichment Score [BioRxiv DOI later]

Contact

Seulki Kwon — skwon@broadinstitute.org
Broad Institute of MIT and Harvard

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
data		data
notebooks		notebooks
results		results
src/pfes		src/pfes
.gitignore		.gitignore
README.md		README.md
glossary_protein_feature.md		glossary_protein_feature.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Feature Enrichment Score (PFES)

Repository Structure

Quick Start: Query a variant(s)

PFES Batch Scorer

Installation

Input format

Usage

Handling failures and missing scores

Data

Notebooks

Results

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Protein Feature Enrichment Score (PFES)

Repository Structure

Quick Start: Query a variant(s)

PFES Batch Scorer

Installation

Input format

Usage

Handling failures and missing scores

Data

Notebooks

Results

Citation

Contact

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages