|
| 1 | +# All of Us RNA-seq eQTL and sQTL Analysis Pipeline |
| 2 | + |
| 3 | +This README describes the end-to-end workflow for preparing genotypes, generating RNA expression and splicing phenotypes, computing covariates, running cis-QTL analysis with TensorQTL, and performing fine-mapping with SuSiE. |
| 4 | + |
| 5 | +All workflows referenced here are implemented as WDLs in **WARP**. |
| 6 | + |
| 7 | +The *original versions* of these workflows were either created by the GTEx Consortium (see their [GTEx GitHub repository](https://github.com/broadinstitute/gtex-pipeline/tree/master?tab=readme-ov-file)) or the lab for **Dr. Stephen Montgomery Lab** at Stanford University, with major contributions from **Evin Padhi** and **Jon Nguyen**. Their work formed the foundation for the integrated analysis pipeline described here. Portions of the logic originated from the publicly available repository: |
| 8 | + |
| 9 | +* **AoU-Multiomics-Analysis** |
| 10 | + [https://github.com/AoU-Multiomics-Analysis](https://github.com/AoU-Multiomics-Analysis) |
| 11 | + |
| 12 | +This README explains how the pieces fit together and what each WDL component produces. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +# Table of Contents |
| 17 | + |
| 18 | +1. [Overview](#overview) |
| 19 | +2. [Input Requirements](#input-requirements) |
| 20 | +3. [Analysis Flow](#analysis-flow) |
| 21 | + |
| 22 | + * [1. Ancestry Grouping & Sample Lists](#1-ancestry-grouping--sample-lists) |
| 23 | + * [2. Genotype Preparation (`Prepare_VCF`)](#2-genotype-preparation-prepare_vcf) |
| 24 | + * [3. Genotype Dosage Calculation](#3-genotype-dosage-calculation) |
| 25 | + * [4. RNA Alignment, Counts and Splicing BED](#4-rna-alignment-counts-and-splicing-bed) |
| 26 | + * [5. RNA Phenotype Preparation (`Prepare_eQTL`)](#5-rna-phenotype-preparation-prepare_eqtl) |
| 27 | + * [6. Covariate Creation (`MergeCovariates`)](#6-covariate-creation-mergecovariates) |
| 28 | + * [7. cis-eQTL Mapping (TensorQTL)](#7-cis-eqtl-mapping-tensorqtl) |
| 29 | + * [8. FDR Recalculation & Fine-Mapping Prep](#8-fdr-recalculation--fine-mapping-prep) |
| 30 | + * [9. SuSiE Fine-Mapping (`SusieR`)](#9-susie-fine-mapping-susier) |
| 31 | + * [10. Allele Frequency Calculation](#10-allele-frequency-calculation) |
| 32 | + * [11. SuSiE Aggregation](#11-susie-aggregation) |
| 33 | +4. [sQTL Workflow](#sqtl-workflow) |
| 34 | +5. [Acknowledgements](#acknowledgements) |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +# Overview |
| 39 | + |
| 40 | +This pipeline generates all inputs, outputs, and intermediate metadata required for a full cis-expression QTL(eQTL) and cis-splicing QTL (sQTL) analysis: |
| 41 | + |
| 42 | +* Genotype preprocessing and pruning |
| 43 | +* PLINK files and genotype principal components |
| 44 | +* Dosage matrices per ancestry |
| 45 | +* Expression and splicing phenotype matrices |
| 46 | +* Phenotype PCs and additional grouping metadata |
| 47 | +* Covariate tables |
| 48 | +* TensorQTL cis-QTL results |
| 49 | +* SuSiE fine-mapping outputs |
| 50 | +* Aggregated credible sets |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +# **Input Requirements** |
| 55 | + |
| 56 | +To run the workflows described here, you need: |
| 57 | + |
| 58 | +* A joint-called **VCF** containing the relevant samples |
| 59 | +* **Research IDs** partitioned by ancestry or subpopulation |
| 60 | +* RNA expression quantifications (for eQTLs) |
| 61 | +* BAM/CRAM files for splice junction extraction (for sQTLs) |
| 62 | +* Associated metadata (sample-level phenotype table) |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +# **Analysis Flow** |
| 67 | + |
| 68 | +The sections below describe each workflow, its purpose, and expected outputs. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## 1. Ancestry Grouping & Sample Lists |
| 73 | + |
| 74 | +Prepare a table listing sample IDs for each ancestry/subpopulation. |
| 75 | + |
| 76 | +Outputs: |
| 77 | + |
| 78 | +* Sample lists per group |
| 79 | +* Updated tables of sample metadata |
| 80 | +* Input tables needed for downstream WDLs |
| 81 | + |
| 82 | +This step is required before running genotype or phenotype workflows per ancestry. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 2. Genotype Preparation (`Prepare_VCF`) |
| 87 | + |
| 88 | +The [Prepare_VCF](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/prepare_QTL/prepare_VCF:develop?tab=info) WDL performs: |
| 89 | + |
| 90 | +* Variant pruning |
| 91 | +* Conversion of the VCF to PLINK (`pgen`, `psam`, `pvar`) |
| 92 | +* Computation of **genotype PCs** |
| 93 | + |
| 94 | +Outputs: |
| 95 | + |
| 96 | +* Pruned VCF |
| 97 | +* PLINK genotype files |
| 98 | +* Genotype principal component matrix |
| 99 | + |
| 100 | +These outputs are used for both eQTL and sQTL pipelines. |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## 3. Genotype Dosage Calculation |
| 105 | + |
| 106 | +The [CalculateGenotypeDosage](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/calculateGenotypeDosage.wdl) WDL generates genotype dosages per ancestry group. |
| 107 | + |
| 108 | +Outputs: |
| 109 | + |
| 110 | +* Two dosage files per ancestry (variant-by-sample dosage matrices) |
| 111 | + |
| 112 | +Because this step uses only the VCF, it may be integrated with `Prepare_VCF` in future versions. |
| 113 | + |
| 114 | +--- |
| 115 | +## 4. RNA Alignment, Counts and Splicing BED |
| 116 | +The [rnaseq_aou.wdl](./rnaseq_aou.wdl) was modified from the original GTEx pipeline and run with the GENCODE v48 GTF. The resulting counts were used as input for downstream expression QTL analysis. |
| 117 | + |
| 118 | +For splicing QTL (sQTL) analysis, the resulting duplicated-marked aligned BAMs were used as a input to the [leafcutter_bam_to_juc wdl](./leafcutter_bam_to_junc.wdl). |
| 119 | + |
| 120 | +This created a junction file that was used as input for the [leafcutter_cluster.wdl](./leafcutter_cluster.wdl), which produces a BED file for downstream sQTL. |
| 121 | + |
| 122 | +## 5. RNA Phenotype Preparation for eQTL (`Prepare_eQTL`) |
| 123 | + |
| 124 | +The [Prepare_eQTL WDL](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/prepare_eQTL.wdl) processes RNA expression data to generate: |
| 125 | + |
| 126 | +* A **BED-format phenotype matrix** |
| 127 | +* **Phenotype PCs** for downstream covariate construction |
| 128 | + |
| 129 | +Inputs typically include: |
| 130 | + |
| 131 | +* Expression quantifications (TPM/CPM or counts) |
| 132 | +* Sample metadata |
| 133 | +* Gene annotations |
| 134 | + |
| 135 | +Outputs are formatted to match TensorQTL requirements. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## 6. Covariate Creation (`MergeCovariates`) |
| 140 | + |
| 141 | +The [MergeCovariates WDL](https://github.com/AoU-Multiomics-Analysis/prepare_QTL/blob/main/workflows/MergeCovariates.wdl) merges: |
| 142 | + |
| 143 | +* Genotype PCs |
| 144 | +* Phenotype PCs (expression or splicing) |
| 145 | +* Optional grouping variables (for sQTLs) |
| 146 | + |
| 147 | +Outputs: |
| 148 | + |
| 149 | +* A single covariates file for use in TensorQTL |
| 150 | + |
| 151 | +This step ensures consistent ordering and formatting across all inputs. |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## 7. cis-eQTL Mapping (TensorQTL) |
| 156 | + |
| 157 | +The [TensorQTL cis permutations WDL](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/tensorQTL_cis_permutations:main?tab=info) runs **TensorQTL cis-permutation** mode to compute: |
| 158 | + |
| 159 | +* Nominal associations |
| 160 | +* Permutation-based cis-eQTL statistics |
| 161 | +* Beta values, empirical p-values, and effect directions |
| 162 | + |
| 163 | +Notes: |
| 164 | + |
| 165 | +* The optional phenotype groups file is **not** required for eQTL analysis |
| 166 | +* Results can be written directly into a structured output directory or table |
| 167 | +* These outputs form the basis for fine-mapping |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## 8. FDR Recalculation & Fine-Mapping Prep |
| 172 | + |
| 173 | +After TensorQTL completes: |
| 174 | + |
| 175 | +* Recalculate FDR |
| 176 | +* Filter to **FDR ≤ 0.05** |
| 177 | +* Format results into a SuSiE-ready table |
| 178 | + |
| 179 | +This step typically involves: |
| 180 | + |
| 181 | +* Computing q-values |
| 182 | +* Generating a list of significant gene–variant pairs |
| 183 | +* Preparing SuSiE input metadata, including: |
| 184 | + |
| 185 | + * Expression ID |
| 186 | + * Genomic window coordinates |
| 187 | + * Output prefix names (must match SuSiE input requirements) |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +## 9. SuSiE Fine-Mapping (`SusieR`) |
| 192 | + |
| 193 | +This [SusieR WDL](./susieR_workflow.wdl) performs SuSiE fine-mapping for each cis-window. |
| 194 | + |
| 195 | +Inputs: |
| 196 | + |
| 197 | +* Dosage matrices (from Step 3) |
| 198 | +* Significant TensorQTL hits (from Step 7) |
| 199 | +* Expression or splicing phenotype metadata |
| 200 | +* A consistent `OutputPrefix` per phenotype |
| 201 | + |
| 202 | +Outputs include: |
| 203 | + |
| 204 | +* SuSiE credible sets |
| 205 | +* Variant posterior inclusion probabilities |
| 206 | +* Fine-mapped credible intervals |
| 207 | + |
| 208 | +Tips: |
| 209 | + |
| 210 | +* Preemptible VMs can be used to reduce cost |
| 211 | +* For reproducibility, a pinned Docker SHA is recommended |
| 212 | + |
| 213 | +--- |
| 214 | + |
| 215 | +## 10. Allele Frequency Calculation |
| 216 | + |
| 217 | +The [CalculateAF](https://dockstore.org/workflows/github.com/AoU-Multiomics-Analysis/prepare_QTL/calculateAF:main?tab=info) WDL calculates allele frequencies using PLINK. |
| 218 | + |
| 219 | +Outputs: |
| 220 | + |
| 221 | +* Per-variant allele frequencies |
| 222 | +* Additional variant summary metrics |
| 223 | + |
| 224 | +This step is optional but useful for interpretation and downstream reporting. |
| 225 | + |
| 226 | +--- |
| 227 | + |
| 228 | +## 11. SuSiE Aggregation |
| 229 | + |
| 230 | +The [AggregateSusie WDL](./AggregateSusieWorkflow.wdl) aggregates fine-mapping results across all phenotypes. |
| 231 | + |
| 232 | +Inputs: |
| 233 | + |
| 234 | +* Paths to all SuSiE parquet outputs |
| 235 | + |
| 236 | + * Use the **“SusieParquet”** (fine-mapped) files |
| 237 | + * Do **not** use the "Full" parquets (contain all tested variants) |
| 238 | + |
| 239 | +Outputs: |
| 240 | + |
| 241 | +* Combined table of all credible sets |
| 242 | +* Aggregated fine-mapping metadata |
| 243 | +* Summary tables for downstream QTL interpretation |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | +# **sQTL Workflow** |
| 248 | + |
| 249 | +The sQTL pipeline shares genotype components with the eQTL workflow but differs in phenotype preparation and covariate structure. |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +## **1. Leafcutter Junc and Cluster Generation** |
| 254 | + |
| 255 | +Run: |
| 256 | + |
| 257 | +* **Bam2Junc** to extract junctions |
| 258 | +* **Cluster** to identify splice clusters |
| 259 | + |
| 260 | +Outputs: |
| 261 | + |
| 262 | +* Junc files |
| 263 | +* Cluster definitions |
| 264 | +* Leafcutter BED files (cluster-level) |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## **2. Prepare sQTL Phenotypes (`prepare_sQTL`)** |
| 269 | + |
| 270 | +This WDL: |
| 271 | + |
| 272 | +* Consumes Leafcutter BED files |
| 273 | +* Generates a **splicing phenotype BED** |
| 274 | +* Computes **phenotype PCs** |
| 275 | + |
| 276 | +Older versions of the preprocessing script also produced: |
| 277 | + |
| 278 | +* **PhenotypeGroups** (required for TensorQTL) |
| 279 | + |
| 280 | +This is included as a separate workflow below. |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +## **3. Calculate Phenotype Groups** |
| 285 | + |
| 286 | +If phenotype groups are not emitted by the updated splicing phenotype WDL, a supplementary WDL can generate them. |
| 287 | + |
| 288 | +Outputs: |
| 289 | + |
| 290 | +* PhenotypeGroups file |
| 291 | + |
| 292 | +--- |
| 293 | + |
| 294 | +## **4. Merge Covariates (sQTL)** |
| 295 | + |
| 296 | +Identical to the eQTL covariate merging step, but includes: |
| 297 | + |
| 298 | +* Genotype PCs |
| 299 | +* Splicing phenotype PCs |
| 300 | +* PhenotypeGroups file |
| 301 | + |
| 302 | +Outputs: |
| 303 | + |
| 304 | +* Covariate file for TensorQTL sQTL analysis |
| 305 | + |
| 306 | +--- |
| 307 | + |
| 308 | +## **5. TensorQTL cis-sQTL** |
| 309 | + |
| 310 | +This step uses: |
| 311 | + |
| 312 | +* PLINK genotype files |
| 313 | +* Splicing BED phenotype matrix |
| 314 | +* Covariates |
| 315 | +* PhenotypeGroups |
| 316 | + |
| 317 | +Outputs: |
| 318 | + |
| 319 | +* cis-sQTL nominal and permutation results |
| 320 | +* Per-cluster association statistics |
| 321 | + |
| 322 | +Downstream fine-mapping can be performed using the same SuSiE workflow if desired. |
| 323 | + |
| 324 | +--- |
| 325 | + |
| 326 | +# **Acknowledgements** |
| 327 | + |
| 328 | +This pipeline builds upon extensive work by the **Stephen Montgomery Lab** at Stanford University. |
| 329 | +Special thanks to: |
| 330 | + |
| 331 | +* **Evin Padhi** |
| 332 | +* **Jon Nguyen** |
| 333 | + |
| 334 | +for developing foundational versions of many scripts and workflows used in this analysis. |
| 335 | + |
| 336 | +Additional integration, optimization, and workflow migration were performed by the All of Us DRC Multiomics and Pipeline Development teams as part of the WARP workflow suite. |
| 337 | + |
0 commit comments