Skip to content

Commit ddb8822

Browse files
authored
Merge pull request #13 from tedyun/tedyun-regle-020
REGLE analysis code release for publication. This will be included in REGLE v0.2.0 release.
2 parents 76d34dc + 6069678 commit ddb8822

4 files changed

Lines changed: 211 additions & 0 deletions

File tree

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# Replicates all main analyses in the REGLE paper
2+
3+
## Analysis of the embeddings
4+
5+
1. See `embedding_interpretability.ipynb`.
6+
7+
8+
## Principal component analysis (PCA) and spline fitting
9+
10+
See `pca_and_spline_fitting.ipynb`.
11+
12+
13+
## GWAS
14+
15+
1. GWAS on all phenotypes via [BOLT-LMM](https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html):
16+
17+
```[bash]
18+
PHENO_NAME="..."
19+
PHENO_FILE="..."
20+
BOLT_LDSC_DIR="..."
21+
UKB_GENOTYPED_DIR="..."
22+
UKB_IMPUTED_DIR="..."
23+
UKB_BGEN_DIR="..."
24+
bolt \
25+
--numThreads 64 \
26+
--LDscoresFile "${BOLT_LDSC_DIR}/LDSCORE.1000G_EUR.tab.gz" \
27+
--LDscoresMatchBp \
28+
--covarFile "${PHENO_FILE}" \
29+
--phenoFile "${PHENO_FILE}" \
30+
--phenoCol "${PHENO_NAME}" \
31+
--statsFile /tmp/tmp_result_experiment1 \
32+
--fam "${UKB_GENOTYPED_DIR}/all_samples.fam" \
33+
--sampleFile "${UKB_IMPUTED_DIR}/ukb.sample" \
34+
--predBetasFile /tmp/genotyped_variants.betas \
35+
--remove "${UKB_GENOTYPED_DIR}/nonoverlapping_samples.txt" \
36+
--lmmForceNonInf \
37+
--bgenMinMAF 9.999999747378752e-05 \
38+
--bgenMinINFO 0.6000000238418579 \
39+
--bgenFile "${UKB_BGEN_DIR}/ukb_imp_chr10_v3_mininfo_0.6.bgen" \
40+
--statsFileBgenSnps /tmp/tmp_bgen_result_experiment1 \
41+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr10_v2.bed" \
42+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr11_v2.bed" \
43+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr12_v2.bed" \
44+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr13_v2.bed" \
45+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr14_v2.bed" \
46+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr15_v2.bed" \
47+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr16_v2.bed" \
48+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr17_v2.bed" \
49+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr18_v2.bed" \
50+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr19_v2.bed" \
51+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr1_v2.bed" \
52+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr20_v2.bed" \
53+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr21_v2.bed" \
54+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr22_v2.bed" \
55+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr2_v2.bed" \
56+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr3_v2.bed" \
57+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr4_v2.bed" \
58+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr5_v2.bed" \
59+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr6_v2.bed" \
60+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr7_v2.bed" \
61+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr8_v2.bed" \
62+
--bed "${UKB_GENOTYPED_DIR}/ukb_cal_chr9_v2.bed" \
63+
--qCovarCol age \
64+
--qCovarCol age_x_age \
65+
--qCovarCol age_x_sex \
66+
--qCovarCol bmi \
67+
--qCovarCol genotyping_array \
68+
--qCovarCol height_cm \
69+
--qCovarCol height_cm_x_height_cm \
70+
--qCovarCol model_fold \
71+
--qCovarCol occasional_smoker \
72+
--qCovarCol pc1 \
73+
--qCovarCol pc10 \
74+
--qCovarCol pc11 \
75+
--qCovarCol pc12 \
76+
--qCovarCol pc13 \
77+
--qCovarCol pc14 \
78+
--qCovarCol pc15 \
79+
--qCovarCol pc2 \
80+
--qCovarCol pc3 \
81+
--qCovarCol pc4 \
82+
--qCovarCol pc5 \
83+
--qCovarCol pc6 \
84+
--qCovarCol pc7 \
85+
--qCovarCol pc8 \
86+
--qCovarCol pc9 \
87+
--qCovarCol sex \
88+
--qCovarCol smoker \
89+
--qCovarCol smoking_pack_per_year \
90+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr10_v2.bim" \
91+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr11_v2.bim" \
92+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr12_v2.bim" \
93+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr13_v2.bim" \
94+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr14_v2.bim" \
95+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr15_v2.bim" \
96+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr16_v2.bim" \
97+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr17_v2.bim" \
98+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr18_v2.bim" \
99+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr19_v2.bim" \
100+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr1_v2.bim" \
101+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr20_v2.bim" \
102+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr21_v2.bim" \
103+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr22_v2.bim" \
104+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr2_v2.bim" \
105+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr3_v2.bim" \
106+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr4_v2.bim" \
107+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr5_v2.bim" \
108+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr6_v2.bim" \
109+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr7_v2.bim" \
110+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr8_v2.bim" \
111+
--bim "${UKB_GENOTYPED_DIR}/ukb_cal_chr9_v2.bim"
112+
```
113+
114+
115+
## [LDSC](https://github.com/bulik/ldsc)
116+
117+
1. Run munge:
118+
119+
```[bash]
120+
BOLT_GWAS_FILE="..."
121+
LDSC_INPUT_DIR="..."
122+
LDSC_OUTPUT_DIR="..."
123+
source activate ldsc && python /opt/ldsc/munge_sumstats.py \
124+
--sumstats "${BOLT_GWAS_FILE}" \
125+
--merge-alleles "${LDSC_INPUT_DIR}/w_hm3.snplist" \
126+
--out "%{LDSC_OUTPUT_DIR}/munge \
127+
--chunksize 500000
128+
```
129+
130+
1. Run S-LDSC:
131+
132+
```[bash]
133+
source activate ldsc && python /opt/ldsc/ldsc.py \
134+
--h2 "${LDSC_OUTPUT_DIR}/munge.sumstats.gz" \
135+
--ref-ld-chr "${LDSC_INPUT_DIR}/baselineLD." \
136+
--w-ld-chr "${LDSC_INPUT_DIR}/weight." \
137+
--out "${LDSC_OUTPUT_DIR}/ldsc"
138+
```
139+
140+
## [GARFIELD](https://www.ebi.ac.uk/birney-srv/GARFIELD/)
141+
142+
1. For each chromosome run:
143+
```[bash]
144+
GARFIELD_INPUT_DIR="..."
145+
GARFIELD_OUTPUT_DIR="..."
146+
ANNOTATION_LIKE_FILE="..."
147+
INPUT_FILE_P="..."
148+
./garfield/garfield-prep-chr \
149+
-ptags "${GARFIELD_INPUT_DIR}/tags/r01/*"\
150+
-ctags "${GARFIELD_INPUT_DIR}/tags/r08/*" \
151+
-maftss "${GARFIELD_INPUT_DIR}/maftssd/*"\
152+
-pval "${INPUT_FILE_P}"\
153+
-ann "${GARFIELD_INPUT_DIR}/annotation/*"\
154+
-excl -1\
155+
-chr "${CHR}" \
156+
-o "${GARFIELD_OUTPUT_DIR}/tmp_prep_out"
157+
```
158+
159+
1. For each chromosome run:
160+
```[bash]
161+
Rscript garfield-Meff-Padj.R \
162+
-i "${GARFIELD_OUTPUT_DIR}/tmp_prep_out"\
163+
-o "${GARFIELD_OUTPUT_DIR}/tmp_meff_out"
164+
```
165+
166+
1. To compute enrichment:
167+
```[bash]
168+
Rscript garfield-test.R \
169+
-i "${GARFIELD_OUTPUT_DIR}/tmp_prep_out" \
170+
-o "${GARFIELD_OUTPUT_DIR}/tmp_test_out" \
171+
-l "${ANNOTATION_LIKE_FILE}" \
172+
-pt 1e-5,1e-8\
173+
-b m5,n5,t5\
174+
-s 1-1005 \
175+
-c 0
176+
```
177+
178+
1. Plotting
179+
```[bash]
180+
Rscript garfield-plot.R \
181+
-i "${GARFIELD_OUTPUT_DIR}/tmp_prep_out" \
182+
-o "${GARFIELD_OUTPUT_DIR}/tmp_plot_out" \
183+
-l "${ANNOTATION_LIKE_FILE}" \
184+
-t " "\
185+
-f 10 \
186+
-padj "${PVAL_ADJ}"
187+
```
188+
189+
190+
## Polygenic risk score (PRS) analysis
191+
192+
Given the effect sizes computed by BOLT-LMM or by "pruning and thresholding" as
193+
described in the paper, we generated each individual's polygenic risk scores
194+
(PRS) using [PLINK](https://www.cog-genomics.org/plink/2.0/) as follows.
195+
196+
1. We ran PLINK to compute the PRS by the following command:
197+
```[bash]
198+
plink \
199+
--bed $BED_FILE \
200+
--bim $BIM_FILE \
201+
--fam $FAM_FILE \
202+
--read-freq $VARIANT_FREQ_FILE \
203+
--score ${MODEL_FILE} header sum double-dosage \
204+
--out $PLINK_OUT
205+
```
206+
207+
1. See `prs_analysis.ipynb` to compute various PRS metrics we use in the paper
208+
using (paired) bootstrapping.

regle/analysis/embedding_interpretability.ipynb

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)