|
1 | 1 | ## HESS (Heritability Estimation from Summary Statistics) |
2 | 2 |
|
3 | | -HESS estimates the amount of variance in trait explained by typed SNPs at |
4 | | -each single locus on the genome (local SNP-heritability) from GWAS summary |
5 | | -statistics, while accounting for linkage disequilibrium (LD). |
6 | | - |
7 | | ---- |
8 | | - |
9 | | -#### Releases |
10 | | - |
11 | | -[version 0.3-beta](https://github.com/huwenboshi/hess/releases/tag/v0.3-beta) |
12 | | - |
13 | | ---- |
14 | | - |
15 | | -#### Software requirement |
16 | | - |
17 | | -HESS requires [NumPy](http://www.numpy.org/) and |
18 | | -[Python 2.7](https://www.python.org/download/releases/2.7/). |
19 | | -We recommend using [NumPy with Intel MKL]( |
20 | | -https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl) |
21 | | -for maximum speed. |
22 | | - |
23 | | ---- |
24 | | - |
25 | | -#### <a name="input_file_format"></a> Input file format |
26 | | - |
27 | | -HESS requires as input |
28 | | -(1) GWAS summary statistics |
29 | | -(2) reference panel matching the GWAS population |
30 | | -(3) bed files specifying start and end positions of each locus. |
31 | | - |
32 | | -###### Summary statistics |
33 | | - |
34 | | -**To improve computational efficiency and parallelizability, HESS requires |
35 | | -that users split summary statistics into chromosomes**. For each SNP, HESS |
36 | | -requires 6 information (in the listed order): (1) rs ID (2) position |
37 | | -(3) reference allele (4) alternative allele |
38 | | -(5) Z-score (6) sample size. HESS internally filters out strand-ambiguous |
39 | | -SNPs and flips signs of Z-scores based on alleles in the reference panel. |
40 | | -However, user awareness of these details are highly recommended. The |
41 | | -following is an example of summary statistics file. |
42 | | - |
43 | | -``` |
44 | | -rsID pos A0 A1 Z-score N |
45 | | -rs1000 29321 G A -1.6434 89834 |
46 | | -rs1001 29478 T C -0.0152 91021 |
47 | | -rs1002 30500 G A 0.7238 95831 |
48 | | -``` |
49 | | - |
50 | | -###### Input checklist |
51 | | - |
52 | | -Although HESS provides functionality to filter and sort SNPs, we recommend |
53 | | -that users go through the following checklist before applying HESS. |
54 | | - |
55 | | -1. Make sure that the coordinate of SNP positions in the summary |
56 | | -statistics file matches the reference panel (NCBI b37). |
57 | | -2. Make sure that strand-ambiguous SNPs (SNPs with alleles A/T or C/G) |
58 | | -are removed. |
59 | | -3. Make sure that summary statistics are split into chromosomes and |
60 | | -that SNPs are sorted by their positions. |
61 | | - |
62 | | -###### Reference panel |
63 | | - |
64 | | -1000 Genomes Project (phase 3) reference panel for SNPs with MAF > 5% in the |
65 | | -EUR population can be downloaded |
66 | | -[here](https://drive.google.com/open?id=0B0OmLzMQAvWqc3FPcVRDWkdvc2c). |
67 | | - |
68 | | -###### Partition file (bed format) |
69 | | - |
70 | | -Can be downloaded [here](https://bitbucket.org/nygcresearch/ldetect-data/src). |
71 | | - |
72 | | ---- |
73 | | - |
74 | | -#### Pipeline |
75 | | - |
76 | | -HESS estimates local heritability in 2 steps. In step 1, HESS computes |
77 | | -the eigenvalues of LD matrices, and the squared projections of GWAS effect |
78 | | -size vector onto the eigenvectors of LD matrices. In step 2, HESS computes |
79 | | -local SNP heritability estimates and their standard errors, using results |
80 | | -from step 1. |
81 | | - |
82 | | -###### Step 1 - compute eigenvalues and projections |
83 | | - |
84 | | -In this step, HESS computes the eigenvalues of LD matrices, and the squared |
85 | | -projections of GWAS effect size vector onto the eigenvectors of LD matrices. |
86 | | -The following code snippet illustrates the 1st step of HESS. |
87 | | - |
88 | | -```{r, engine='sh', count_lines} |
89 | | -# this for loop can be parallelized, i.e. one CPU for each chromosome |
90 | | -for i in $(seq 22) |
91 | | -do |
92 | | - python hess.py \ |
93 | | - --chrom $i \ |
94 | | - --h2g zscore.chr"$i" \ |
95 | | - --reference-panel refpanel_genotype_chr"$i".gz \ |
96 | | - --legend-file refpanel_legend_chr"$i".gz \ |
97 | | - --partition-file partition_chr"$i".bed \ |
98 | | - --out step1 |
99 | | -done |
100 | | -``` |
101 | | - |
102 | | -In the command above, `--chrom` specifies the chromosome number; |
103 | | -`--zscore-file` specifies the summary statistics for SNPs in the |
104 | | -corresponding chromosome; `--reference-panel` specifies the genotype file |
105 | | -for the reference panel; `--legend-file` specifies the legend file for the |
106 | | -reference panel; `--partition-file` specifies start and end positions |
107 | | -of the loci; `--output-file-step1` specifies the prefix of the output for step 1. |
108 | | -For input file format, please refer to |
109 | | -[Input file format](#input_file_format). |
110 | | - |
111 | | -After executing the command above, 4 files will be created for each |
112 | | -chromosome (i.e. 88 files in total), taking up ~10MB of space for the entire |
113 | | -genome. Here's an example obtained for chromosome 22. |
114 | | - |
115 | | -* step1\_chr22.info.gz - contains the information of each locus (start and |
116 | | - end positions, number of SNPs, rank of LD matrices, sample size) |
117 | | -``` |
118 | | -16050408 17674294 371 274 91273 |
119 | | -17674295 18296087 419 306 89182 |
120 | | -18296088 19912357 947 502 90231 |
121 | | -... ... ... ... ... |
122 | | -``` |
123 | | -* step1\_chr22.eig.gz - contains the positive eigenvalues of LD matrix at |
124 | | -each locus, one line per locus |
125 | | -``` |
126 | | -39.31792281 31.23990243 23.81549256 23.47296559 20.45343550 ... |
127 | | -48.73186142 26.95692375 25.32769526 22.11750791 20.55766423 ... |
128 | | -82.58157342 67.42588424 59.52766188 43.10471854 32.15181631 ... |
129 | | -... ... ... ... ... ... |
130 | | -``` |
131 | | -* step1\_chr22.prjsq.gz - contains the squared projections of effect |
132 | | -size vector onto the eigenvectors of LD matrix at each locus, one |
133 | | -line per locus |
134 | | -``` |
135 | | -0.00008940 0.00001401 0.00013805 0.00009906 0.00007841 ... |
136 | | -0.00054948 0.00001756 0.00008532 0.00002303 0.00004706 ... |
137 | | -0.00008693 0.00005737 0.00070234 0.00008411 0.00004001 ... |
138 | | -... ... ... ... ... ... |
139 | | -``` |
140 | | -* step1\_chr22.log - contains logging information (e.g. number of SNPs, |
141 | | -number of SNPs filtered, etc.) |
142 | | -``` |
143 | | -Command started at: ... |
144 | | -Command issued: hess.py ... |
145 | | -Number of SNPs in reference panel: ... |
146 | | -Number of SNPs in Z-score file: ... |
147 | | -Number of SNPs in Z-score file after filtering: ... |
148 | | -Number of loci in partition file: ... |
149 | | -Command finished at: ... |
150 | | -``` |
151 | | - |
152 | | -###### Step 2 - compute local SNP heritability |
153 | | -**Step 2 should be run after step 1 finishes for all chromosomes.** |
154 | | -In this step, HESS uses results from step 1 across all chromosomes |
155 | | -(step1\_chr{1..22}.info.gz, step1\_chr{1..22}.eig.gz, |
156 | | -step1\_chr{1..22}.prjsq.gz) to compute local SNP heritability estimates |
157 | | -and their standard error. The following command automatically looks for |
158 | | -results from step 1 across all chromosomes with the prefix "step1" to |
159 | | -obtain local SNP-heritability estimates. |
160 | | - |
161 | | -```{r, engine='sh', count_lines} |
162 | | -python hess.py \ |
163 | | - --prefix step1 \ |
164 | | - --k 50 \ |
165 | | - --out step2.txt |
166 | | -``` |
167 | | - |
168 | | -In the command above, `--prefix` specifies prefix of the files generated |
169 | | -during step 1, "step1", in this case; `--k`, default at 50, specifies the |
170 | | -maximum number of eigenvectors to use in estimating local SNP heritability; |
171 | | -`--output-file-step2` specifies the name of the output file. |
172 | | - |
173 | | -After executing the command above, 2 files will be created. |
174 | | - |
175 | | -* step2.txt - contains local SNP heritability estimates for loci across all |
176 | | -chromosomes (including chromosome number, locus start position, locus end |
177 | | -position, number of SNPs in locus, number of eigenvectors used, local SNP |
178 | | -heritability, variance) |
179 | | -``` |
180 | | -chr start end num_snp k local_h2g var |
181 | | -1 10583 1892606 158 24 0.0001786340 0.000000011374 |
182 | | -1 1892607 3582735 814 40 0.0004164805 0.000000039661 |
183 | | -1 3582736 4380810 558 40 0.0001844619 0.000000027595 |
184 | | -1 4380811 5913892 1879 40 0.0000738749 0.000000032164 |
185 | | -... ... ... ... ... ... ... |
186 | | -22 46470495 47596317 899 50 0.0004263759 0.000000005798 |
187 | | -22 47596318 48903702 1580 50 0.0000899976 0.000000003539 |
188 | | -22 48903703 49824533 1344 50 0.0000695594 0.000000003439 |
189 | | -22 49824534 51243297 740 50 0.0001590363 0.000000004160 |
190 | | -``` |
191 | | -* step2.txt.log - contains logging information (e.g. estimated genomic |
192 | | -control factor, total SNP heritability, etc.) |
193 | | -``` |
194 | | -Command started at: ... |
195 | | -Command issued: ... |
196 | | -Number of loci from step 1: ... |
197 | | -Total number of SNPs: ... |
198 | | -Using lambda gc: ... |
199 | | -Estimated total h2g: ... |
200 | | -Command finished at: ... |
201 | | -``` |
202 | | - |
203 | | -###### Additional flags for step 2 |
204 | | - |
205 | | -For step 2, HESS has 4 additional flags: |
206 | | -* `--lambda_gc` allows users to specify their own genomic control factor to |
207 | | -re-inflate the summary statistics, if not specified, HESS will estimates |
208 | | -the genomic control factor from data |
209 | | -* `--tot-h2g <h2g> <s.e.>` allows users to specify total SNP heritability |
210 | | -of the trait |
211 | | -* `--sense-threshold-joint` default at 2.0, allows users to control standard |
212 | | -error of the estimates when total SNP heritability is not known, the smaller |
213 | | -the threshold, the smaller the standard error (at the cost of downward bias) |
214 | | -* `--sense-threshold-indep` default at 0.5, allows users to control standard |
215 | | -error of the estimates when total SNP heritability is available, the smaller |
216 | | -the threshold, the smaller the standard error (at the cost of downward bias) |
217 | | -* `--eig-threshold` default at 1.0, allows users to filter eigenvectors |
218 | | -based on magnitude of eigenvalues |
219 | | - |
220 | | ---- |
221 | | - |
222 | | -#### Contact |
223 | | - |
224 | | -Please contact Huwenbo Shi (shihuwenbo\_AT\_ucla.edu) for questions |
225 | | -related to HESS. |
226 | | - |
227 | | ---- |
228 | | - |
229 | | -#### Reference |
230 | | - |
231 | | -Manuscript describing HESS can be found |
232 | | -[here](http://www.cell.com/ajhg/abstract/S0002-9297(16)30148-3). |
| 3 | +HESS is a Python package that provides utilities for estimating and analyzing |
| 4 | +local SNP-heritability and genetic covariance from GWAS summary |
| 5 | +association data. |
| 6 | +[](https://huwenboshi.github.io/hess-0.5) |
0 commit comments