|
| 1 | +# Admixture & Ancestry Estimation Workflows (WDL) |
| 2 | + |
| 3 | +This directory contains three WDL workflows used for preparing genotype data and estimating genetic ancestry proportions. These workflows are designed to operate on genotype data derived from a **combined reference panel of 1000 Genomes Project (1KG) and Human Genome Diversity Project (HGDP)** samples. |
| 4 | + |
| 5 | +## Background & Interpretation Notes |
| 6 | + |
| 7 | +The 1KG + HGDP reference panel provides broad global coverage but has **known limitations**, including uneven population representation and limited resolution for certain ancestries. As a result: |
| 8 | + |
| 9 | +* Some ancestry components may be **overestimated or underestimated**, depending on the method and reference composition. |
| 10 | +* Outputs from these workflows should be **interpreted cautiously** and in the context of these reference limitations. |
| 11 | +* Results are best suited for **population-level summaries** rather than precise individual-level ancestry inference. |
| 12 | + |
| 13 | +To address specific biases observed with Rye-based admixture estimation, an alternative ADMIXTURE-based workflow is included and was used to generate a **pruned reference for All of Us (AoU) admixture analyses**. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Workflow 1: `convert_vcf_to_plink_bed` |
| 18 | + |
| 19 | +### Description |
| 20 | + |
| 21 | +Converts a merged VCF file into PLINK binary format (`.bed/.bim/.fam`) for downstream ancestry and admixture analyses. |
| 22 | + |
| 23 | +### Required Inputs |
| 24 | + |
| 25 | +* `prefix` (String): Prefix for output files |
| 26 | +* `merged_vcf_shards` (File): Merged VCF file |
| 27 | +* `merged_vcf_shards_idx` (File): Index file for the merged VCF |
| 28 | + |
| 29 | +### Tasks & Software |
| 30 | + |
| 31 | +* **Task:** `convert_vcf_to_plink_bed` |
| 32 | +* **Software:** |
| 33 | + |
| 34 | + * PLINK (`/app/bin/plink`) |
| 35 | + * Docker image: `mussmann/admixpipe:3.0` |
| 36 | + |
| 37 | +PLINK is run with: |
| 38 | + |
| 39 | +* `--vcf` input |
| 40 | +* `--make-bed` to generate binary PLINK files |
| 41 | +* `--double-id` and `--allow-extra-chr` for compatibility with reference data |
| 42 | + |
| 43 | +### Outputs |
| 44 | + |
| 45 | +* `*.bed`: PLINK binary genotype file |
| 46 | +* `*.bim`: Variant information file |
| 47 | +* `*.fam`: Sample metadata file |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## Workflow 2: `run_admixture_est_rye` |
| 52 | + |
| 53 | +### Description |
| 54 | + |
| 55 | +Generates ancestry (admixture) estimates using the **Rye** tool, based on PCA-derived eigenvalues and eigenvectors. Outputs are analogous to ADMIXTURE `Q` and `fam` files. |
| 56 | + |
| 57 | +This workflow is useful for fast, PCA-informed ancestry estimation but was observed to **overestimate certain populations** when using the AoU reference. |
| 58 | + |
| 59 | +### Required Inputs |
| 60 | + |
| 61 | +* `eigenvalues_file` (File): PCA eigenvalues |
| 62 | +* `eigenvec_file` (File): PCA eigenvectors |
| 63 | +* `pop2group_file` (File): Population-to-group mapping |
| 64 | +* `prefix` (String): Output file prefix |
| 65 | +* `pcs` (Int, default = 20): Number of principal components |
| 66 | +* `rounds` (Int, default = 200): Optimization rounds |
| 67 | +* `iter` (Int, default = 100): Optimization iterations |
| 68 | +* `cpus` (Int, default = 16) |
| 69 | +* `docker_image` (String): Rye Docker image |
| 70 | + |
| 71 | +### Tasks & Software |
| 72 | + |
| 73 | +* **Task:** `run_rye` |
| 74 | +* **Software:** |
| 75 | + |
| 76 | + * Rye (`rye.R`) |
| 77 | + * Docker image: |
| 78 | + `us-central1-docker.pkg.dev/broad-dsde-methods/aou-auxiliary/rye-admixture-estimation-tool:v1.0` |
| 79 | + |
| 80 | +### Outputs |
| 81 | + |
| 82 | +* `*.Q`: Admixture proportion file |
| 83 | +* `*.fam`: Sample metadata file |
| 84 | + |
| 85 | +Output files are renamed for consistency: |
| 86 | + |
| 87 | +``` |
| 88 | +<prefix>-<pcs>.Q |
| 89 | +<prefix>-<pcs>.fam |
| 90 | +``` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## Workflow 3: `run_admixture` |
| 95 | + |
| 96 | +### Description |
| 97 | + |
| 98 | +Runs the **ADMIXTURE** software directly on PLINK binary files to estimate ancestry proportions. |
| 99 | + |
| 100 | +This workflow was used to generate a **pruned AoU admixture reference**, specifically because **Rye-based estimates were found to overestimate certain populations**, while ADMIXTURE tends to **underestimate those same groups**, providing a complementary and more conservative estimate. |
| 101 | + |
| 102 | +### Required Inputs |
| 103 | + |
| 104 | +* `bed` (File): PLINK `.bed` file |
| 105 | +* `bim` (File): PLINK `.bim` file |
| 106 | +* `fam` (File): PLINK `.fam` file |
| 107 | +* `K_in` (Int, optional): Number of ancestry components (default = 6) |
| 108 | +* `num_cpus_in` (Int, optional): Number of CPU threads (default = 4) |
| 109 | +* `mem_gb` (Int, default = 120): Memory allocation |
| 110 | + |
| 111 | +### Tasks & Software |
| 112 | + |
| 113 | +* **Task:** `run_admixture` |
| 114 | +* **Software:** |
| 115 | + |
| 116 | + * ADMIXTURE (`/app/bin/admixture`) |
| 117 | + * Docker image: `mussmann/admixpipe:3.0` |
| 118 | + |
| 119 | +ADMIXTURE is executed with multithreading (`-j`) for performance. |
| 120 | + |
| 121 | +### Outputs |
| 122 | + |
| 123 | +* `*.Q`: Ancestry proportion matrix |
| 124 | +* `*.P`: Population allele frequency matrix |
| 125 | + |
| 126 | +Output naming follows ADMIXTURE conventions: |
| 127 | + |
| 128 | +``` |
| 129 | +<basename>.<K>.Q |
| 130 | +<basename>.<K>.P |
| 131 | +``` |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## Summary |
| 136 | + |
| 137 | +Together, these workflows support: |
| 138 | + |
| 139 | +1. Conversion of VCF data to PLINK format |
| 140 | +2. PCA-based admixture estimation using Rye |
| 141 | +3. Direct admixture estimation using ADMIXTURE for reference refinement |
| 142 | + |
| 143 | +When interpreting results, users should consider: |
| 144 | + |
| 145 | +* Reference panel composition (1KG + HGDP) |
| 146 | +* Method-specific biases (Rye vs. ADMIXTURE) |
| 147 | +* The intended use case (population-level inference vs. individual ancestry) |
| 148 | + |
0 commit comments