|
| 1 | +# GenoTools Reference Panel Preparation |
| 2 | + |
| 3 | +## Overview |
| 4 | +This documentation provides a detailed description of how to prepare a reference panel for use in the `GenoTools` ancestry module. Preparing a custom reference panel will allow for the prediction of samples within ancestry groups outside of the 11 included in the provided reference panel. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +### Format |
| 9 | +In order to run the proper pruning steps, as well as use the custom reference panel in the ancestry module, please make sure the reference genotypes are in PLINK1.9 binary format (.bed/.bim/.fam): https://www.cog-genomics.org/plink/1.9/formats. In addition, you will need to create a tab-delimited file containing the sample IDs as well as the reference ancestry groups (see Preparing the `--ref_labels` File section). |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +### Pruning Script |
| 14 | +Please run the following script in order to prune the custom reference panel for use in the `GenoTools` ancestry model: |
| 15 | + |
| 16 | +```python |
| 17 | +import subprocess |
| 18 | +import sys |
| 19 | +import os |
| 20 | +import shutil |
| 21 | +import pandas as pd |
| 22 | +import numpy as np |
| 23 | + |
| 24 | +from genotools.dependencies import check_plink, check_plink2 |
| 25 | +from genotools.utils import shell_do |
| 26 | + |
| 27 | +def prune_ref_panel(ref_path, ld_path): |
| 28 | + prune_path = f'{ref_path}_maf_geno_hwe' |
| 29 | + final_path = f'{ref_path}_pruned' |
| 30 | + |
| 31 | + # read bim and get palindromes |
| 32 | + bim = pd.read_csv(f'{ref_path}.bim', sep='\s+', header=None) |
| 33 | + bim.columns = ['chr','id','kb','pos','alt','ref'] |
| 34 | + palindromes = bim.loc[((bim.alt == 'A') & (bim.ref == 'T')) | ((bim.alt == 'T') & (bim.ref == 'A')) | ((bim.alt == 'C') & (bim.ref == 'G')) | ((bim.alt == 'G') & (bim.ref == 'C'))] |
| 35 | + palindromes['id'].to_csv(f'{ref_path}_palindromes.snplist', header=False, index=False, sep='\t') |
| 36 | + |
| 37 | + |
| 38 | + # maf, geno, hwe command |
| 39 | + plink_cmd1 = f'{plink_exec} --bfile {ref_path}\ |
| 40 | + --maf 0.05\ |
| 41 | + --geno 0.01\ |
| 42 | + --hwe 0.0001\ |
| 43 | + --indep-pairwise 1000 10 0.02\ |
| 44 | + --exclude {geno_path}_palindromes.snplist\ |
| 45 | + --autosome\ |
| 46 | + --allow-extra-chr\ |
| 47 | + --make-bed\ |
| 48 | + --out {prune_path}' |
| 49 | + |
| 50 | + shell_do(plink_cmd1) |
| 51 | + |
| 52 | + # ld command |
| 53 | + plink_cmd2 = f'{plink2_exec} --bfile {prune_path}\ |
| 54 | + --exclude range {ld_path}\ |
| 55 | + --autosome\ |
| 56 | + --allow-extra-chr\ |
| 57 | + --make-bed\ |
| 58 | + --out {final_path}' |
| 59 | + |
| 60 | + shell_do(plink2_cmd2) |
| 61 | + |
| 62 | +if __name__ == '__main__': |
| 63 | + ref_path = '/path/to/reference/reference/genotypes' |
| 64 | + ld_path = '/path/to/ld/exclusion_regions.txt' |
| 65 | + |
| 66 | + prune_ref_panel(ref_path, ld_path) |
| 67 | +``` |
| 68 | + |
| 69 | +### Parameters |
| 70 | +- **ref_path**: Path to PLINK 1.9 format reference panel genotype file (before the *.bed/bim/fam). |
| 71 | +- **ld_path**: Path to exclusion regions file in tab-delimited .txt format (see LD Exclusion Region File Format section). |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +### Description of Pruning Steps |
| 76 | +1. Find palindrome SNPs from .bim file, and write to ```{ref_path}_palindromes.snplist```. |
| 77 | +2. Prune reference genotypes for ```--maf```, ```--geno```, ```--hwe```, and linkage disequilibrium, as well as removing the previously identified palindrome SNPs. |
| 78 | +3. Exclude high-LD regions from reference genotypes. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +### LD Exclusion Region File Format (hg38) |
| 83 | +Copy this to a .txt file and pass to the ```prune_ref_panel()``` function in the provided Python script: |
| 84 | +``` |
| 85 | +6 24999772 33532223 r1 |
| 86 | +6 26726947 26726981 r2 |
| 87 | +8 8142478 12142491 r3 |
| 88 | +20 28534739 28534795 r4 |
| 89 | +11 50809938 50809993 r5 |
| 90 | +2 94531009 94531062 r6 |
| 91 | +11 54525074 55026708 r7 |
| 92 | +12 34504916 34504974 r8 |
| 93 | +11 55104000 55104074 r9 |
| 94 | +``` |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +### Preparing the `--ref_labels` File |
| 99 | +In order to render predictions using the new reference panel, the `GenoTools` ancestry module needs information on which ancestry group each sample belongs to. The file passed to the `--ref_labels` flag when running the ancestry module must be tab-delimited, and contain three columns corresponding to `FID`, `IID`, and `label`. Please note this file should not have a header row and the IDs must match those in the `.fam` file for the process to run correctly. For more information on how to use the `--ref_labels` flag, please see the `train_new_model.md` documentation. Below is an example of how the `--ref_labels` file should look: |
| 100 | +``` |
| 101 | +FID1 IID1 ANCESTRY1 |
| 102 | +FID2 IID2 ANCESTRY2 |
| 103 | +FID3 IID3 ANCESTRY2 |
| 104 | +FID4 IID4 ANCESTRY3 |
| 105 | +``` |
0 commit comments