Skip to content

Commit e66465e

Browse files
authored
Merge pull request #176 from dvitale199/docs
model training and ref panel prep docs
2 parents 564aa4b + d6f1ddf commit e66465e

File tree

2 files changed

+153
-0
lines changed

2 files changed

+153
-0
lines changed

docs/prep_reference_panel.md

+105
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# GenoTools Reference Panel Preparation
2+
3+
## Overview
4+
This documentation provides a detailed description of how to prepare a reference panel for use in the `GenoTools` ancestry module. Preparing a custom reference panel will allow for the prediction of samples within ancestry groups outside of the 11 included in the provided reference panel.
5+
6+
---
7+
8+
### Format
9+
In order to run the proper pruning steps, as well as use the custom reference panel in the ancestry module, please make sure the reference genotypes are in PLINK1.9 binary format (.bed/.bim/.fam): https://www.cog-genomics.org/plink/1.9/formats. In addition, you will need to create a tab-delimited file containing the sample IDs as well as the reference ancestry groups (see Preparing the `--ref_labels` File section).
10+
11+
---
12+
13+
### Pruning Script
14+
Please run the following script in order to prune the custom reference panel for use in the `GenoTools` ancestry model:
15+
16+
```python
17+
import subprocess
18+
import sys
19+
import os
20+
import shutil
21+
import pandas as pd
22+
import numpy as np
23+
24+
from genotools.dependencies import check_plink, check_plink2
25+
from genotools.utils import shell_do
26+
27+
def prune_ref_panel(ref_path, ld_path):
28+
prune_path = f'{ref_path}_maf_geno_hwe'
29+
final_path = f'{ref_path}_pruned'
30+
31+
# read bim and get palindromes
32+
bim = pd.read_csv(f'{ref_path}.bim', sep='\s+', header=None)
33+
bim.columns = ['chr','id','kb','pos','alt','ref']
34+
palindromes = bim.loc[((bim.alt == 'A') & (bim.ref == 'T')) | ((bim.alt == 'T') & (bim.ref == 'A')) | ((bim.alt == 'C') & (bim.ref == 'G')) | ((bim.alt == 'G') & (bim.ref == 'C'))]
35+
palindromes['id'].to_csv(f'{ref_path}_palindromes.snplist', header=False, index=False, sep='\t')
36+
37+
38+
# maf, geno, hwe command
39+
plink_cmd1 = f'{plink_exec} --bfile {ref_path}\
40+
--maf 0.05\
41+
--geno 0.01\
42+
--hwe 0.0001\
43+
--indep-pairwise 1000 10 0.02\
44+
--exclude {geno_path}_palindromes.snplist\
45+
--autosome\
46+
--allow-extra-chr\
47+
--make-bed\
48+
--out {prune_path}'
49+
50+
shell_do(plink_cmd1)
51+
52+
# ld command
53+
plink_cmd2 = f'{plink2_exec} --bfile {prune_path}\
54+
--exclude range {ld_path}\
55+
--autosome\
56+
--allow-extra-chr\
57+
--make-bed\
58+
--out {final_path}'
59+
60+
shell_do(plink2_cmd2)
61+
62+
if __name__ == '__main__':
63+
ref_path = '/path/to/reference/reference/genotypes'
64+
ld_path = '/path/to/ld/exclusion_regions.txt'
65+
66+
prune_ref_panel(ref_path, ld_path)
67+
```
68+
69+
### Parameters
70+
- **ref_path**: Path to PLINK 1.9 format reference panel genotype file (before the *.bed/bim/fam).
71+
- **ld_path**: Path to exclusion regions file in tab-delimited .txt format (see LD Exclusion Region File Format section).
72+
73+
---
74+
75+
### Description of Pruning Steps
76+
1. Find palindrome SNPs from .bim file, and write to ```{ref_path}_palindromes.snplist```.
77+
2. Prune reference genotypes for ```--maf```, ```--geno```, ```--hwe```, and linkage disequilibrium, as well as removing the previously identified palindrome SNPs.
78+
3. Exclude high-LD regions from reference genotypes.
79+
80+
---
81+
82+
### LD Exclusion Region File Format (hg38)
83+
Copy this to a .txt file and pass to the ```prune_ref_panel()``` function in the provided Python script:
84+
```
85+
6 24999772 33532223 r1
86+
6 26726947 26726981 r2
87+
8 8142478 12142491 r3
88+
20 28534739 28534795 r4
89+
11 50809938 50809993 r5
90+
2 94531009 94531062 r6
91+
11 54525074 55026708 r7
92+
12 34504916 34504974 r8
93+
11 55104000 55104074 r9
94+
```
95+
96+
---
97+
98+
### Preparing the `--ref_labels` File
99+
In order to render predictions using the new reference panel, the `GenoTools` ancestry module needs information on which ancestry group each sample belongs to. The file passed to the `--ref_labels` flag when running the ancestry module must be tab-delimited, and contain three columns corresponding to `FID`, `IID`, and `label`. Please note this file should not have a header row and the IDs must match those in the `.fam` file for the process to run correctly. For more information on how to use the `--ref_labels` flag, please see the `train_new_model.md` documentation. Below is an example of how the `--ref_labels` file should look:
100+
```
101+
FID1 IID1 ANCESTRY1
102+
FID2 IID2 ANCESTRY2
103+
FID3 IID3 ANCESTRY2
104+
FID4 IID4 ANCESTRY3
105+
```

docs/train_new_model.md

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# GenoTools Train New Ancestry Prediction Models
2+
3+
## Overview
4+
This documentation provides a detailed description of how to train a new model with the `GenoTools` ancestry module. There are two main use cases for training a new model:
5+
1. When none of the available pretrained models are suited for your data but you would like to use the provided reference panel and
6+
2. When you want to train a model using a new reference panel with different ancestry groups than those available through the provided reference panel.
7+
For the second use case, please see the `prep_reference_panel.md` documentation on how to properly prepare that reference panel and labels file for use in the ancestry module.
8+
9+
---
10+
11+
### 1. Train New Model with the Provided Reference Panel
12+
To download the provided reference panel, you can run the following command:
13+
```
14+
genotools-download --ref 1kg_30x_hgdp_ashk_ref_panel
15+
```
16+
By default, this will be downloaded to ~/.genotools/ref/ref_panel, but can be downloaded to a location of choice with the --destination flag:
17+
```
18+
genotools-download --ref 1kg_30x_hgdp_ashk_ref_panel --destination /path/to/desired/download/location
19+
```
20+
21+
To train a new model using this downloaded reference panel, use the following command:
22+
```
23+
genotools \
24+
--pfile /path/to/genotypes/for/ancestry/prediction \
25+
--out /path/to/ancestry/prediction/output \
26+
--ancestry \
27+
--ref_panel /path/to/downloaded/reference/panel \
28+
--ref_labels /path/to/downloaded/reference/ancestry/labels \
29+
```
30+
This command will train a model and render predictions for the provided genotypes. Please note that currently model training takes ~5 hours so this process is best done on HPC with a batch job submission system.
31+
The model will be saved to: `{out}_umap_linearsvc_ancestry_model.pkl` and the SNPs used to train the model will be saved to `{out}_umap_linearsvc_ancestry_model.common_snps`. Please save these files and keep them in the same directory for future use on your genotypes!
32+
33+
---
34+
35+
### 2. Train New Model with a Custom Reference Panel
36+
Once again, please see the `prep_reference_panel.md` documentation for instructions on how to properly prep your reference panel.
37+
38+
To train a new model using the custom reference panel, use the following command:
39+
```
40+
genotools \
41+
--pfile /path/to/genotypes/for/ancestry/prediction \
42+
--out /path/to/ancestry/prediction/output \
43+
--ancestry \
44+
--ref_panel /path/to/created/reference/panel \
45+
--ref_labels /path/to/created/reference/ancestry/labels \
46+
```
47+
This command will train a model and render predictions for the provided genotypes. Please note that currently model training takes ~5 hours so this process is best done on HPC with a batch job submission system.
48+
The model will be saved to: `{out}_umap_linearsvc_ancestry_model.pkl` and the SNPs used to train the model will be saved to `{out}_umap_linearsvc_ancestry_model.common_snps`. Please save these files and keep them in the same directory for future use on your genotypes!

0 commit comments

Comments
 (0)