Skip to content

Commit c9c4fa3

Browse files
authored
Lk aou admixture unsupervised (#1759)
Added admixture wdls and docs and cmrg readme
1 parent c35a85a commit c9c4fa3

9 files changed

Lines changed: 431 additions & 2 deletions

.dockstore.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ workflows:
3030
subclass: WDL
3131
primaryDescriptorPath: /all_of_us/rna_seq/CalculatePhenotypeGroups.wdl
3232

33+
- name: convert_vcf_to_plink_bed
34+
subclass: WDL
35+
primaryDescriptorPath: /all_of_us/admixture/convert_vcf_to_plink_bed.wdl
36+
3337
- name: CramToUnmappedBams
3438
subclass: WDL
3539
primaryDescriptorPath: /pipelines/wdl/reprocessing/cram_to_unmapped_bams/CramToUnmappedBams.wdl
@@ -143,7 +147,10 @@ workflows:
143147
- name: run_preprocess_admixture_est_rye
144148
subclass: WDL
145149
primaryDescriptorPath: /all_of_us/admixture/run_preprocess_admixture_est_rye.wdl
146-
150+
151+
- name: run_admixture
152+
subclass: WDL
153+
primaryDescriptorPath: /all_of_us/admixture/run_admixture.wdl
147154
- name: run_admixture_est_rye
148155
subclass: WDL
149156
primaryDescriptorPath: /all_of_us/admixture/run_admixture_est_rye.wdl

all_of_us/admixture/README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Admixture & Ancestry Estimation Workflows (WDL)
2+
3+
This directory contains three WDL workflows used for preparing genotype data and estimating genetic ancestry proportions. These workflows are designed to operate on genotype data derived from a **combined reference panel of 1000 Genomes Project (1KG) and Human Genome Diversity Project (HGDP)** samples.
4+
5+
## Background & Interpretation Notes
6+
7+
The 1KG + HGDP reference panel provides broad global coverage but has **known limitations**, including uneven population representation and limited resolution for certain ancestries. As a result:
8+
9+
* Some ancestry components may be **overestimated or underestimated**, depending on the method and reference composition.
10+
* Outputs from these workflows should be **interpreted cautiously** and in the context of these reference limitations.
11+
* Results are best suited for **population-level summaries** rather than precise individual-level ancestry inference.
12+
13+
To address specific biases observed with Rye-based admixture estimation, an alternative ADMIXTURE-based workflow is included and was used to generate a **pruned reference for All of Us (AoU) admixture analyses**.
14+
15+
---
16+
17+
## Workflow 1: `convert_vcf_to_plink_bed`
18+
19+
### Description
20+
21+
Converts a merged VCF file into PLINK binary format (`.bed/.bim/.fam`) for downstream ancestry and admixture analyses.
22+
23+
### Required Inputs
24+
25+
* `prefix` (String): Prefix for output files
26+
* `merged_vcf_shards` (File): Merged VCF file
27+
* `merged_vcf_shards_idx` (File): Index file for the merged VCF
28+
29+
### Tasks & Software
30+
31+
* **Task:** `convert_vcf_to_plink_bed`
32+
* **Software:**
33+
34+
* PLINK (`/app/bin/plink`)
35+
* Docker image: `mussmann/admixpipe:3.0`
36+
37+
PLINK is run with:
38+
39+
* `--vcf` input
40+
* `--make-bed` to generate binary PLINK files
41+
* `--double-id` and `--allow-extra-chr` for compatibility with reference data
42+
43+
### Outputs
44+
45+
* `*.bed`: PLINK binary genotype file
46+
* `*.bim`: Variant information file
47+
* `*.fam`: Sample metadata file
48+
49+
---
50+
51+
## Workflow 2: `run_admixture_est_rye`
52+
53+
### Description
54+
55+
Generates ancestry (admixture) estimates using the **Rye** tool, based on PCA-derived eigenvalues and eigenvectors. Outputs are analogous to ADMIXTURE `Q` and `fam` files.
56+
57+
This workflow is useful for fast, PCA-informed ancestry estimation but was observed to **overestimate certain populations** when using the AoU reference.
58+
59+
### Required Inputs
60+
61+
* `eigenvalues_file` (File): PCA eigenvalues
62+
* `eigenvec_file` (File): PCA eigenvectors
63+
* `pop2group_file` (File): Population-to-group mapping
64+
* `prefix` (String): Output file prefix
65+
* `pcs` (Int, default = 20): Number of principal components
66+
* `rounds` (Int, default = 200): Optimization rounds
67+
* `iter` (Int, default = 100): Optimization iterations
68+
* `cpus` (Int, default = 16)
69+
* `docker_image` (String): Rye Docker image
70+
71+
### Tasks & Software
72+
73+
* **Task:** `run_rye`
74+
* **Software:**
75+
76+
* Rye (`rye.R`)
77+
* Docker image:
78+
`us-central1-docker.pkg.dev/broad-dsde-methods/aou-auxiliary/rye-admixture-estimation-tool:v1.0`
79+
80+
### Outputs
81+
82+
* `*.Q`: Admixture proportion file
83+
* `*.fam`: Sample metadata file
84+
85+
Output files are renamed for consistency:
86+
87+
```
88+
<prefix>-<pcs>.Q
89+
<prefix>-<pcs>.fam
90+
```
91+
92+
---
93+
94+
## Workflow 3: `run_admixture`
95+
96+
### Description
97+
98+
Runs the **ADMIXTURE** software directly on PLINK binary files to estimate ancestry proportions.
99+
100+
This workflow was used to generate a **pruned AoU admixture reference**, specifically because **Rye-based estimates were found to overestimate certain populations**, while ADMIXTURE tends to **underestimate those same groups**, providing a complementary and more conservative estimate.
101+
102+
### Required Inputs
103+
104+
* `bed` (File): PLINK `.bed` file
105+
* `bim` (File): PLINK `.bim` file
106+
* `fam` (File): PLINK `.fam` file
107+
* `K_in` (Int, optional): Number of ancestry components (default = 6)
108+
* `num_cpus_in` (Int, optional): Number of CPU threads (default = 4)
109+
* `mem_gb` (Int, default = 120): Memory allocation
110+
111+
### Tasks & Software
112+
113+
* **Task:** `run_admixture`
114+
* **Software:**
115+
116+
* ADMIXTURE (`/app/bin/admixture`)
117+
* Docker image: `mussmann/admixpipe:3.0`
118+
119+
ADMIXTURE is executed with multithreading (`-j`) for performance.
120+
121+
### Outputs
122+
123+
* `*.Q`: Ancestry proportion matrix
124+
* `*.P`: Population allele frequency matrix
125+
126+
Output naming follows ADMIXTURE conventions:
127+
128+
```
129+
<basename>.<K>.Q
130+
<basename>.<K>.P
131+
```
132+
133+
---
134+
135+
## Summary
136+
137+
Together, these workflows support:
138+
139+
1. Conversion of VCF data to PLINK format
140+
2. PCA-based admixture estimation using Rye
141+
3. Direct admixture estimation using ADMIXTURE for reference refinement
142+
143+
When interpreting results, users should consider:
144+
145+
* Reference panel composition (1KG + HGDP)
146+
* Method-specific biases (Rye vs. ADMIXTURE)
147+
* The intended use case (population-level inference vs. individual ancestry)
148+
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# aou_9.0.0
2+
2025-11-24 (Date of Last Commit)
3+
4+
* Added convert_vcf_to_plink wdl
5+
* Added pipeline_version string
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
version 1.0
2+
3+
4+
workflow convert_vcf_to_plink_bed {
5+
input {
6+
String prefix
7+
File merged_vcf_shards
8+
File merged_vcf_shards_idx
9+
}
10+
String pipeline_version = "aou_9.0.1"
11+
12+
call convert_vcf_to_plink_bed {
13+
input:
14+
prefix=prefix,
15+
vcf=merged_vcf_shards,
16+
vcf_idx=merged_vcf_shards_idx
17+
}
18+
19+
output {
20+
File bed = convert_vcf_to_plink_bed.bed
21+
File bim = convert_vcf_to_plink_bed.bim
22+
File fam = convert_vcf_to_plink_bed.fam
23+
}
24+
}
25+
task convert_vcf_to_plink_bed {
26+
input {
27+
String prefix
28+
File vcf
29+
File vcf_idx
30+
}
31+
parameter_meta {
32+
}
33+
command <<<
34+
set -e
35+
/app/bin/plink --double-id --vcf ~{vcf} --make-bed --allow-extra-chr --out ~{prefix}
36+
>>>
37+
38+
output {
39+
File bed = "~{prefix}.bed"
40+
File bim = "~{prefix}.bim"
41+
File fam = "~{prefix}.fam"
42+
}
43+
44+
runtime {
45+
docker: "mussmann/admixpipe:3.0"
46+
memory: "31 GB"
47+
cpu: "4"
48+
disks: "local-disk 500 HDD"
49+
}
50+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# aou_9.0.1
2+
2026-01-29 (Date of Last Commit)
3+
4+
* Parameterized memory for local ancestry reference
5+
6+
# aou_9.0.0
7+
2025-11-24 (Date of Last Commit)
8+
9+
* Added run admixture wdl for aou 9.0.0 processing to allow for unsupervised clustering
10+
* Added pipeline_version as an input string
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
version 1.0
2+
3+
# For more information on the files and the contents, see: http://dalexander.github.io/admixture/admixture-manual.pdf
4+
workflow run_admixture {
5+
input {
6+
File bed
7+
File bim
8+
File fam
9+
}
10+
11+
String pipeline_version="aou_9.0.2"
12+
13+
call run_admixture {
14+
input:
15+
bed=bed,
16+
bim=bim,
17+
fam=fam
18+
}
19+
20+
output {
21+
File admixture_Q = run_admixture.Q
22+
File admixture_P = run_admixture.P
23+
}
24+
}
25+
task run_admixture {
26+
input {
27+
File bed
28+
File bim
29+
File fam
30+
Int? K_in
31+
Int? num_cpus_in
32+
Int mem_gb = 120
33+
}
34+
Int K = select_first([K_in, 6])
35+
Int num_cpus = select_first([num_cpus_in, 4])
36+
String basename = basename(bed, ".bed")
37+
38+
command <<<
39+
set -e
40+
/app/bin/admixture ~{bed} ~{K} -j~{num_cpus}
41+
42+
ls -la
43+
44+
>>>
45+
46+
output {
47+
File Q = "~{basename}.~{K}.Q"
48+
File P = "~{basename}.~{K}.P"
49+
}
50+
51+
runtime {
52+
docker: "mussmann/admixpipe:3.0"
53+
memory: mem_gb + " GB" # Was 31 GB originally, increased for local ancestry
54+
cpu: "~{num_cpus}"
55+
disks: "local-disk 500 HDD"
56+
}
57+
}

all_of_us/cmrg/FixItFelixAndVariantCall.changelog.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# aou_9.0.1
2+
2026-01-29 (Date of Last Commit)
3+
* Added set e with pipefail to FixItFelix
4+
15
# aou_9.0.0
26
2025-08-11 (Date of Last Commit)
37

all_of_us/cmrg/FixItFelixAndVariantCall.wdl

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ workflow FixItFelixAndVariantCall {
2626
}
2727

2828
Int original_cram_size = ceil(size(cram_file, "GB"))
29-
String pipeline_version = "aou_9.0.0"
29+
String pipeline_version = "aou_9.0.1"
3030

3131
call subset_cram {
3232
input:
@@ -106,6 +106,7 @@ task subset_cram {
106106
String output_index = sample_name + ".bai"
107107

108108
command {
109+
set -euo pipefail
109110
~{gatk_path} --java-options "-Xmx2G" \
110111
PrintReads \
111112
-R ~{ref_fasta} \
@@ -148,6 +149,7 @@ task FixItFelix {
148149
}
149150

150151
command <<<
152+
set -euo pipefail
151153
bam=~{reads}
152154
bed=~{intervals}
153155
ref=~{masked_ref_fasta}
@@ -250,6 +252,7 @@ task call_variants {
250252
String output_filename = output_name + (if generate_gvcf then ".g.vcf.gz" else ".vcf.gz")
251253

252254
command <<<
255+
set -euo pipefail
253256
~{gatk_path} --java-options "-Xmx3G" \
254257
HaplotypeCaller \
255258
-R ~{ref_fasta} \

0 commit comments

Comments
 (0)