Phenotype Embedding Similarity-based Rare Disease Gene Mapping
This repository contains the R code supporting the analysis described in the paper:
PERADIGM: Phenotype Embedding Similarity-based Rare Disease Gene Mapping
PERADIGM is a framework that integrates phenotype embedding and patient similarity to identify rare disease-associated genes using large-scale biobank data. This repository includes code to replicate the key analyses and figures from the study.
-
main.R
: Main script for the analysis, including:- Data loading and preprocessing
- Running phenotype-gene association tests
- Generating similarity matrices and embeddings
- Outputting statistical results
-
function.R
: Contains all helper functions for:- Embedding computation
- Similarity scoring
- Regression-based testing
- Carrier/control selection
Place your data files using the following directory structure:
data/
βββ R_doc/
β βββ hesin_diag_all_new.RData
β βββ eid_all.RData
β βββ cov_adjust.RData
β βββ IC_hesin_500k.csv
βββ icd_related/
β βββ ICD10_mapping.csv
βββ generate_all_gene_pos/
β βββ gene_info.RData
βββ embedding/
β βββ hesin_icd10_descrip_embed.txt
βββ hesin_diag.txt # Optional/redundant diagnosis file
To reproduce the analysis:
- Ensure R and required packages are installed.
- Place the data files in the correct subfolders as shown above.
- Run
main.R
to initiate the pipeline.