Skip to content

Commit 59e915a

Browse files
authored
Create README.md for QTL analysis (#1729)
1 parent c9c4fa3 commit 59e915a

9 files changed

Lines changed: 365 additions & 3 deletions

all_of_us/rna_seq/AggregateSusieWorkflow.changelog.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# aou_9.0.0
1+
# aou_9.0.2
2+
3+
2026-01-29 (Date of Last Commit)
4+
* Added set euo pipefail to tasks
5+
6+
# aou_9.0.1
27
2025-11-17 (Date of Last Commit)
38

49
* Added the AggregateSusieWorkflow script for AoU V9 processing; this workflow was copied from the repository: https://github.com/AoU-Multiomics-Analysis/aggregate_susie/blob/main/merge_susie.wdl

all_of_us/rna_seq/AggregateSusieWorkflow.wdl

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ task AggregateSusie{
1010
}
1111

1212
command <<<
13+
set -euo pipefail
1314
export GSUTIL_PARALLEL_PROCESS_COUNT=32
1415
export GSUTIL_PARALLEL_THREAD_COUNT=8
1516
@@ -60,6 +61,7 @@ task AnnotateSusie {
6061

6162
}
6263
command <<<
64+
set -euo pipefail
6365
Rscript /tmp/annotate_susie_data.R \
6466
--OutputPrefix ~{OutputPrefix} \
6567
--GencodeGTF ~{GencodeGTF} \
@@ -106,7 +108,7 @@ workflow AggregateSusieWorkflow {
106108

107109
}
108110

109-
String pipeline_version = "aou_9.0.1"
111+
String pipeline_version = "aou_9.0.2"
110112

111113
call AggregateSusie {
112114
input:

all_of_us/rna_seq/CalculatePhenotypeGroups.changelog.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# aou_9.0.1
2+
2026-01-29 (Date of Last Commit)
3+
4+
* Added set euo pipefail to tasks
5+
16
# aou_9.0.0
27
2025-11-19 (Date of Last Commit)
38

all_of_us/rna_seq/CalculatePhenotypeGroups.wdl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ task PrepareSpliceData {
1212
Int num_threads
1313
}
1414
command {
15+
set -euo pipefail
1516
Rscript /tmp/PrepareSpliceData.R \
1617
--SpliceData ${SpliceData} \
1718
--SampleList ${SampleList} \

all_of_us/rna_seq/README.md

Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
# All of Us RNA-seq eQTL and sQTL Analysis Pipeline
2+
3+
This README describes the end-to-end workflow for preparing genotypes, generating RNA expression and splicing phenotypes, computing covariates, running cis-QTL analysis with TensorQTL, and performing fine-mapping with SuSiE.
4+
5+
All workflows referenced here are implemented as WDLs in **WARP**.
6+
7+
The *original versions* of these workflows were either created by the GTEx Consortium (see their [GTEx GitHub repository](https://github.com/broadinstitute/gtex-pipeline/tree/master?tab=readme-ov-file)) or the lab for **Dr. Stephen Montgomery Lab** at Stanford University, with major contributions from **Evin Padhi** and **Jon Nguyen**. Their work formed the foundation for the integrated analysis pipeline described here. Portions of the logic originated from the publicly available repository:
8+
9+
* **AoU-Multiomics-Analysis**
10+
[https://github.com/AoU-Multiomics-Analysis](https://github.com/AoU-Multiomics-Analysis)
11+
12+
This README explains how the pieces fit together and what each WDL component produces.
13+
14+
---
15+
16+
# Table of Contents
17+
18+
1. [Overview](#overview)
19+
2. [Input Requirements](#input-requirements)
20+
3. [Analysis Flow](#analysis-flow)
21+
22+
* [1. Ancestry Grouping & Sample Lists](#1-ancestry-grouping--sample-lists)
23+
* [2. Genotype Preparation (`Prepare_VCF`)](#2-genotype-preparation-prepare_vcf)
24+
* [3. Genotype Dosage Calculation](#3-genotype-dosage-calculation)
25+
* [4. RNA Alignment, Counts and Splicing BED](#4-rna-alignment-counts-and-splicing-bed)
26+
* [5. RNA Phenotype Preparation (`Prepare_eQTL`)](#5-rna-phenotype-preparation-prepare_eqtl)
27+
* [6. Covariate Creation (`MergeCovariates`)](#6-covariate-creation-mergecovariates)
28+
* [7. cis-eQTL Mapping (TensorQTL)](#7-cis-eqtl-mapping-tensorqtl)
29+
* [8. FDR Recalculation & Fine-Mapping Prep](#8-fdr-recalculation--fine-mapping-prep)
30+
* [9. SuSiE Fine-Mapping (`SusieR`)](#9-susie-fine-mapping-susier)
31+
* [10. Allele Frequency Calculation](#10-allele-frequency-calculation)
32+
* [11. SuSiE Aggregation](#11-susie-aggregation)
33+
4. [sQTL Workflow](#sqtl-workflow)
34+
5. [Acknowledgements](#acknowledgements)
35+
36+
---
37+
38+
# Overview
39+
40+
This pipeline generates all inputs, outputs, and intermediate metadata required for a full cis-expression QTL(eQTL) and cis-splicing QTL (sQTL) analysis:
41+
42+
* Genotype preprocessing and pruning
43+
* PLINK files and genotype principal components
44+
* Dosage matrices per ancestry
45+
* Expression and splicing phenotype matrices
46+
* Phenotype PCs and additional grouping metadata
47+
* Covariate tables
48+
* TensorQTL cis-QTL results
49+
* SuSiE fine-mapping outputs
50+
* Aggregated credible sets
51+
52+
---
53+
54+
# **Input Requirements**
55+
56+
To run the workflows described here, you need:
57+
58+
* A joint-called **VCF** containing the relevant samples
59+
* **Research IDs** partitioned by ancestry or subpopulation
60+
* RNA expression quantifications (for eQTLs)
61+
* BAM/CRAM files for splice junction extraction (for sQTLs)
62+
* Associated metadata (sample-level phenotype table)
63+
64+
---
65+
66+
# **Analysis Flow**
67+
68+
The sections below describe each workflow, its purpose, and expected outputs.
69+
70+
---
71+
72+
## 1. Ancestry Grouping & Sample Lists
73+
74+
Prepare a table listing sample IDs for each ancestry/subpopulation.
75+
76+
Outputs:
77+
78+
* Sample lists per group
79+
* Updated tables of sample metadata
80+
* Input tables needed for downstream WDLs
81+
82+
This step is required before running genotype or phenotype workflows per ancestry.
83+
84+
---
85+
86+
## 2. Genotype Preparation (`Prepare_VCF`)
87+
88+
The [Prepare_VCF](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/prepare_QTL/prepare_VCF:develop?tab=info) WDL performs:
89+
90+
* Variant pruning
91+
* Conversion of the VCF to PLINK (`pgen`, `psam`, `pvar`)
92+
* Computation of **genotype PCs**
93+
94+
Outputs:
95+
96+
* Pruned VCF
97+
* PLINK genotype files
98+
* Genotype principal component matrix
99+
100+
These outputs are used for both eQTL and sQTL pipelines.
101+
102+
---
103+
104+
## 3. Genotype Dosage Calculation
105+
106+
The [CalculateGenotypeDosage](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/calculateGenotypeDosage.wdl) WDL generates genotype dosages per ancestry group.
107+
108+
Outputs:
109+
110+
* Two dosage files per ancestry (variant-by-sample dosage matrices)
111+
112+
Because this step uses only the VCF, it may be integrated with `Prepare_VCF` in future versions.
113+
114+
---
115+
## 4. RNA Alignment, Counts and Splicing BED
116+
The [rnaseq_aou.wdl](./rnaseq_aou.wdl) was modified from the original GTEx pipeline and run with the GENCODE v48 GTF. The resulting counts were used as input for downstream expression QTL analysis.
117+
118+
For splicing QTL (sQTL) analysis, the resulting duplicated-marked aligned BAMs were used as a input to the [leafcutter_bam_to_juc wdl](./leafcutter_bam_to_junc.wdl).
119+
120+
This created a junction file that was used as input for the [leafcutter_cluster.wdl](./leafcutter_cluster.wdl), which produces a BED file for downstream sQTL.
121+
122+
## 5. RNA Phenotype Preparation for eQTL (`Prepare_eQTL`)
123+
124+
The [Prepare_eQTL WDL](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/prepare_eQTL.wdl) processes RNA expression data to generate:
125+
126+
* A **BED-format phenotype matrix**
127+
* **Phenotype PCs** for downstream covariate construction
128+
129+
Inputs typically include:
130+
131+
* Expression quantifications (TPM/CPM or counts)
132+
* Sample metadata
133+
* Gene annotations
134+
135+
Outputs are formatted to match TensorQTL requirements.
136+
137+
---
138+
139+
## 6. Covariate Creation (`MergeCovariates`)
140+
141+
The [MergeCovariates WDL](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/MergeCovariates.wdl) merges:
142+
143+
* Genotype PCs
144+
* Phenotype PCs (expression or splicing)
145+
* Optional grouping variables (for sQTLs)
146+
147+
Outputs:
148+
149+
* A single covariates file for use in TensorQTL
150+
151+
This step ensures consistent ordering and formatting across all inputs.
152+
153+
---
154+
155+
## 7. cis-eQTL Mapping (TensorQTL)
156+
157+
The [TensorQTL cis permutations WDL](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/tensorQTL_cis_permutations:main?tab=info) runs **TensorQTL cis-permutation** mode to compute:
158+
159+
* Nominal associations
160+
* Permutation-based cis-eQTL statistics
161+
* Beta values, empirical p-values, and effect directions
162+
163+
Notes:
164+
165+
* The optional phenotype groups file is **not** required for eQTL analysis
166+
* Results can be written directly into a structured output directory or table
167+
* These outputs form the basis for fine-mapping
168+
169+
---
170+
171+
## 8. FDR Recalculation & Fine-Mapping Prep
172+
173+
After TensorQTL completes:
174+
175+
* Recalculate FDR
176+
* Filter to **FDR ≤ 0.05**
177+
* Format results into a SuSiE-ready table
178+
179+
This step typically involves:
180+
181+
* Computing q-values
182+
* Generating a list of significant gene–variant pairs
183+
* Preparing SuSiE input metadata, including:
184+
185+
* Expression ID
186+
* Genomic window coordinates
187+
* Output prefix names (must match SuSiE input requirements)
188+
189+
---
190+
191+
## 9. SuSiE Fine-Mapping (`SusieR`)
192+
193+
This [SusieR WDL](./susieR_workflow.wdl) performs SuSiE fine-mapping for each cis-window.
194+
195+
Inputs:
196+
197+
* Dosage matrices (from Step 3)
198+
* Significant TensorQTL hits (from Step 7)
199+
* Expression or splicing phenotype metadata
200+
* A consistent `OutputPrefix` per phenotype
201+
202+
Outputs include:
203+
204+
* SuSiE credible sets
205+
* Variant posterior inclusion probabilities
206+
* Fine-mapped credible intervals
207+
208+
Tips:
209+
210+
* Preemptible VMs can be used to reduce cost
211+
* For reproducibility, a pinned Docker SHA is recommended
212+
213+
---
214+
215+
## 10. Allele Frequency Calculation
216+
217+
The [CalculateAF](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/prepare_QTL/calculateAF:main?tab=info) WDL calculates allele frequencies using PLINK.
218+
219+
Outputs:
220+
221+
* Per-variant allele frequencies
222+
* Additional variant summary metrics
223+
224+
This step is optional but useful for interpretation and downstream reporting.
225+
226+
---
227+
228+
## 11. SuSiE Aggregation
229+
230+
The [AggregateSusie WDL](./AggregateSusieWorkflow.wdl) aggregates fine-mapping results across all phenotypes.
231+
232+
Inputs:
233+
234+
* Paths to all SuSiE parquet outputs
235+
236+
* Use the **“SusieParquet”** (fine-mapped) files
237+
* Do **not** use the "Full" parquets (contain all tested variants)
238+
239+
Outputs:
240+
241+
* Combined table of all credible sets
242+
* Aggregated fine-mapping metadata
243+
* Summary tables for downstream QTL interpretation
244+
245+
---
246+
247+
# **sQTL Workflow**
248+
249+
The sQTL pipeline shares genotype components with the eQTL workflow but differs in phenotype preparation and covariate structure.
250+
251+
---
252+
253+
## **1. Leafcutter Junc and Cluster Generation**
254+
255+
Run:
256+
257+
* **Bam2Junc** to extract junctions
258+
* **Cluster** to identify splice clusters
259+
260+
Outputs:
261+
262+
* Junc files
263+
* Cluster definitions
264+
* Leafcutter BED files (cluster-level)
265+
266+
---
267+
268+
## **2. Prepare sQTL Phenotypes (`prepare_sQTL`)**
269+
270+
This WDL:
271+
272+
* Consumes Leafcutter BED files
273+
* Generates a **splicing phenotype BED**
274+
* Computes **phenotype PCs**
275+
276+
Older versions of the preprocessing script also produced:
277+
278+
* **PhenotypeGroups** (required for TensorQTL)
279+
280+
This is included as a separate workflow below.
281+
282+
---
283+
284+
## **3. Calculate Phenotype Groups**
285+
286+
If phenotype groups are not emitted by the updated splicing phenotype WDL, a supplementary WDL can generate them.
287+
288+
Outputs:
289+
290+
* PhenotypeGroups file
291+
292+
---
293+
294+
## **4. Merge Covariates (sQTL)**
295+
296+
Identical to the eQTL covariate merging step, but includes:
297+
298+
* Genotype PCs
299+
* Splicing phenotype PCs
300+
* PhenotypeGroups file
301+
302+
Outputs:
303+
304+
* Covariate file for TensorQTL sQTL analysis
305+
306+
---
307+
308+
## **5. TensorQTL cis-sQTL**
309+
310+
This step uses:
311+
312+
* PLINK genotype files
313+
* Splicing BED phenotype matrix
314+
* Covariates
315+
* PhenotypeGroups
316+
317+
Outputs:
318+
319+
* cis-sQTL nominal and permutation results
320+
* Per-cluster association statistics
321+
322+
Downstream fine-mapping can be performed using the same SuSiE workflow if desired.
323+
324+
---
325+
326+
# **Acknowledgements**
327+
328+
This pipeline builds upon extensive work by the **Stephen Montgomery Lab** at Stanford University.
329+
Special thanks to:
330+
331+
* **Evin Padhi**
332+
* **Jon Nguyen**
333+
334+
for developing foundational versions of many scripts and workflows used in this analysis.
335+
336+
Additional integration, optimization, and workflow migration were performed by the All of Us DRC Multiomics and Pipeline Development teams as part of the WARP workflow suite.
337+

all_of_us/rna_seq/aggregate_rsem_results.changelog.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# aou_9.0.1
2+
2026-01-29 (Date of Last Commit)
3+
4+
* Added set euo pipefail to tasks
5+
16
# aou_9.0.0
27
2025-06-06 (Date of Last Commit)
38

0 commit comments

Comments
 (0)