high-dimensional-stats-r/episodes/02-high-dimensional-regression.Rmd at 3d8a1e1c26362f10712476bd8ed40b49c1897e46 · carpentries-incubator/high-dimensional-stats-r · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Regression with many outcomes
source: Rmd
teaching: 70
exercises: 50
math: yes
editor_options:
  markdown:
    wrap: 72
---

::::::::::::::::::::::::::::::::::::::: objectives

- Perform and critically analyse high-dimensional regression.
- Understand methods for shrinkage of noise parameters in high-dimensional regression.
- Perform multiple testing adjustment.

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: questions

- How can we apply linear regression in a high-dimensional setting?
- How can we benefit from the fact that we have many outcomes?
- How can we control for the fact that we do many tests?

::::::::::::::::::::::::::::::::::::::::::::::::::


## DNA methylation data

For the following few episodes, we will be working with human DNA
methylation data from flow-sorted blood samples, described in [data](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html). DNA methylation assays
measure, for each of many sites in the genome, the proportion of DNA
that carries a methyl mark (a chemical modification that does not alter the
DNA sequence). In this case, the methylation data come in
the form of a matrix of normalised methylation levels (M-values), where negative
values correspond to unmethylated DNA and positive values correspond to
methylated DNA. Along with this, we have a number of sample phenotypes
(eg, age in years, BMI).

Let's read in the data for this episode:

```{r loadmethy}
library("SummarizedExperiment")
methylation <- readRDS("data/methylation.rds")
```

Note: if you want to view the code used to download these data, the code is available in this repository in the [`methylation.R` script](https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/main/episodes/data/methylation.R)

`methylation` is actually a special Bioconductor `SummarizedExperiment`
object that summarises lots of different information about the data.
These objects are very useful for storing all of the information
about a dataset in a high-throughput context. The structure of `SummarizedExperiment`
objects is described in the [vignettes on
Bioconductor](https://www.bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).
Here, we show how to extract the information for analysis.

We can extract

- the dimensions of the dataset using `dim()`. Importantly, in these objects and
  data structures for computational biology in R generally, observations are stored as
  columns and features (in this case, sites in the genome) are stored as rows.
  This is in contrast to usual tabular data, where features or variables
  are stored as columns and observations are stored as rows;
- assays, (normalised methylation levels here), using `assay()`;
- sample-level information using `colData()`.

```{r showmethy}
dim(methylation)
```

You can see in this output that this object has a `dim()` of
$`r nrow(methylation)` \\times `r ncol(methylation)`$, meaning it has
`r nrow(methylation)` *features* and `r ncol(methylation)` *observations*. To
extract the matrix of methylation M-values, we can use the
`assay()` function.

```{r grabx}
methyl_mat <- assay(methylation)
```

The distribution of these M-values looks like this:

```{r histx, fig.cap="Methylation levels are generally bimodally distributed.", fig.alt="Histogram of M-values for all features. The distribution appears to be bimodal, with a large number of unmethylated features as well as many methylated features, and many intermediate features."}
hist(methyl_mat, xlab = "M-value")
```

You can see that there are two peaks in this distribution, corresponding
to features which are largely unmethylated and methylated, respectively.

Similarly, we can examine the `colData()`, which represents the
sample-level metadata we have relating to these data. In this case, the
metadata, phenotypes, and groupings in the `colData` look like this for
the first 6 samples:

```{r, eval=FALSE}
head(colData(methylation))
```

```{r datatable, echo=FALSE}
knitr::kable(head(colData(methylation)), row.names = FALSE)
```

In this episode, we will focus on the association between age and
methylation. The following heatmap summarises age and methylation levels
available in the methylation dataset:

```{r heatmap, fig.cap="Heatmap of methylation values across all features.", fig.alt="Heatmap of methylation values across all features showing that there are many features. Samples are ordered according to age."}
library("pheatmap")

age <- methylation$Age

# sort methylation values by age
order <- order(age)
age_ord <- age[order]
methyl_mat_ord <- methyl_mat[, order]

## create annotation df with rownames matching columns of methylation matrix
ann_df <- data.frame(age = age_ord, row.names=colnames(methyl_mat_ord))

# plot heatmap
pheatmap(methyl_mat_ord,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         show_colnames = FALSE,
         legend_title = "M-value",
         main = "Feature vs Sample",
         annotation_col = ann_df)
```

:::::::::::::::::::::::::::::::::::::::  challenge

### Challenge 1

Why can we not just fit many linear regression models relating every combination of feature
(`colData` and assays) and draw conclusions by associating all variables with significant model p-values?

:::::::::::::::  solution

### Solution

There are a number of problems that this kind of approach presents.
For example:

1. If we perform `r nrow(methylation)` tests for each of
  `r ncol(colData(methylation))` variables, even if there were no true
  associations in the data, we'd be likely to observe some strong
  spurious associations that arise just from random noise.
2. We may not have a representative
  sample for each of these covariates. For example, we may have very
  small sample sizes for some ethnicities, leading to spurious
  findings.
3. Without a research question in mind when creating a
  model, it's not clear how we can interpret each model, and
  rationalising the results after the fact can be dangerous; it's easy
  to make up a "story" that isn't grounded in anything but the fact
  that we have significant findings.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

In general, it is scientifically interesting to explore two modelling problems using the three types of data:

1. Predicting methylation levels using age as a predictor. In this case, we
  would have 5000 outcomes (methylation levels across the genome) and a single
  covariate (age).

2. Predicting age using methylation levels as predictors. In this case, we would
  have a single outcome (age) which will be predicted using 5000 covariates
  (methylation levels across the genome).

The examples in this episode will focus on the first type of problem, whilst
the next episode will focus on the second.

:::::::::::::::::::::::::::::::::::::::::  callout

### Measuring DNA Methylation

DNA methylation is an epigenetic modification of DNA. Generally, we
are interested in the proportion of methylation at many sites or
regions in the genome. DNA methylation microarrays, as we are using
here, measure DNA methylation using two-channel microarrays, where one
channel captures signal from methylated DNA and the other captures
unmethylated signal. These data can be summarised as "Beta values"
($\\beta$ values), which is the ratio of the methylated signal to the
total signal (methylated plus unmethylated). The $\\beta$ value for
site $i$ is calculated as

$$
\beta_i = \frac{
m_i
} {
u_{i} + m_{i}
}
$$

where $m\_i$ is the methylated signal for site $i$ and $u\_i$ is the
unmethylated signal for site $i$. $\\beta$ values take on a value in
the range $[0, 1]$, with 0 representing a completely unmethylated site
and 1 representing a completely methylated site.

The M-values we use here are the $\\log\_2$ ratio of methylated versus
unmethylated signal:

$$
M_i = \log_2\left(\frac{m_i}{u_i}\right)
$$

M-values are not bounded to an interval as Beta values are, and
therefore can be easier to work with in statistical models.


::::::::::::::::::::::::::::::::::::::::::::::::::

## Regression with many outcomes

In high-throughput studies, it is common to have one or more phenotypes
or groupings that we want to relate to features of interest (eg, gene
expression, DNA methylation levels). In general, we want to identify
differences in the features of interest that are related to a phenotype
or grouping of our samples. Identifying features of interest that vary
along with phenotypes or groupings can allow us to understand how
phenotypes arise or manifest. Analysis of this type is sometimes referred
to using the term *differential analysis*.

For example, we might want to identify genes that are expressed at a
higher level in mutant mice relative to wild-type mice to understand the
effect of a mutation on cellular phenotypes. Alternatively, we might
have samples from a set of patients, and wish to identify epigenetic
features that are different in young patients relative to old patients,
to help us understand how ageing manifests.

Using linear regression, it is possible to identify differences like
these. However, high-dimensional data like the ones we're working with
require some special considerations. A first consideration, as we saw
above, is that there are far too many features to fit each one-by-one as
we might do when analysing low-dimensional datasets (for example using
`lm()` on each feature and checking the linear model assumptions). A
second consideration is that statistical approaches may behave
slightly differently when applied to very high-dimensional data, compared to
low-dimensional data. A third consideration is the speed at which we can
actually compute statistics for data this large -- methods optimised for
low-dimensional data may be very slow when applied to high-dimensional
data.

Ideally when performing regression, we want to identify cases like this,
where there is a clear association, and we probably "don't need"
statistics:

```{r example1, echo=FALSE, fig.cap="A scatter plot of age and a feature of interest.", fig.alt="An example of a strong linear association between a continuous phenotype (age) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. A strong linear relationship with a positive slope exists between the two.", fig.width=6, fig.height=6}
library("ggplot2")
theme_set(theme_bw())
set.seed(42)
n <- 10
x <- c(rnorm(n, 0), rnorm(n, 3))
group <- (2 * x) + rnorm(n)
ggplot() +
    aes(x = group, y = x) +
    geom_point() +
    labs(x = "Age", y = "DNA methylation")
```

or equivalently for a discrete covariate:

```{r example2, echo=FALSE, fig.cap="A scatter plot of a grouping and a feature of interest.", fig.alt="An example of a strong linear association between a discrete phenotype (group) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. The two groups clearly differ with respect to DNA methylation.", fig.width=6, fig.height=6}
library("ggplot2")
set.seed(42)
n <- 10
x <- c(rnorm(n, 0), rnorm(n, 5))
group <- c(rep("A", n), rep("B", n))
ggplot() +
    aes(x = group, y = x, colour = group) +
    # geom_violin() +
    # geom_boxplot(width = 0.25) +
    geom_jitter(height = 0, width = 0.2) +
    labs(y = "DNA methylation")
```

However, often due to small differences and small sample sizes, the
problem is more difficult:

```{r example3, echo=FALSE, fig.cap="A scatter plot of a grouping and a feature of interest.", fig.alt="An example of a strong linear association between a discrete phenotype (group) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. The two groups seem to differ with respect to DNA methylation, but the relationship is weak.", fig.width=6, fig.height=6}
library("ggplot2")
set.seed(66)
n <- 5
x <- c(rnorm(n, 0), rnorm(n, 1))
group <- c(rep("A", n), rep("B", n))
ggplot() +
    aes(x = group, y = x, colour = group) +
    # geom_violin() +
    # geom_boxplot(width = 0.25) +
    geom_jitter(height = 0, width = 0.2) +
    labs(y = "DNA methylation")
```

And, of course, we often have an awful lot of features and need to
prioritise a subset of them! We need a rigorous way to prioritise genes
for further analysis.

## Fitting a linear model

So, in the data we have read in, we have a matrix of methylation values
$X$ and a vector of ages, $y$. One way to model this is to see if we can
use age to predict the expected (average) methylation value for sample
$j$ at a given locus $i$, which we can write as $X\_{ij}$. We can write
that model as:

$$
\mathbf{E}(X_{ij}) = \beta_0 + \beta_1 \text{Age}_j
$$

where $\\text{Age}\_j$ is the age of sample $j$. In this model, $\\beta\_1$
represents the unit change in mean methylation level for each unit
(year) change in age. For a specific CpG, we can fit this model and get more
information from the model object. For illustration purposes, here we
arbitrarily select the first CpG in the `methyl_mat` matrix (the one on its first row).

```{r fit1}
age <- methylation$Age
# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
lm_age_methyl1
```

We now have estimates for the expected methylation level when age equals
0 (the intercept) and the change in methylation level for a unit change
in age (the slope). We could plot this linear model:

```{r plot-lm-methyl1, fig.cap="A scatter plot of age versus the methylation level for an arbitrarily selected CpG side (the one stored as the first column of methyl_mat). Each dot represents an individual. The black line represents the estimated linear model.", fig.alt="An example of the relationship between age (x-axis) and methylation levels (y-axis) for an arbitrarily selected CpG. In this case, the y-axis shows methylation levels for the first CpG in our data. The black line shows the fitted regression line (based on the intercept and slope estimates shown above). For this feature, we can see that there is no strong relationship between methylation and age."}
plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16)
abline(lm_age_methyl1)
```

For this feature, we can see that there is no strong relationship
between methylation and age. We could try to repeat this for every
feature in our dataset; however, we have a lot of features! We need an
approach that allows us to assess associations between all of these
features and our outcome while addressing the three considerations we
outlined previously. Before we introduce this approach, let's go into
detail about how we generally check whether the results of a linear
model are statistically significant.

## Hypothesis testing in linear regression

Using the linear model we defined above, we can ask questions based on the
estimated value for the regression coefficients. For example, do individuals
with different age have different methylation values for a given CpG? We usually
do this via *hypothesis testing*. This framework compares the results that we
observed (here, estimated linear model coefficients) to the results you would
expect under a *null hypothesis* associated to our question. In the example above,
a suitable null hypothesis would test whether the regression coefficient associated
to age ($\\beta\_1$) is equal to zero or not. If $\\beta\_1$ is equal to zero,
the linear model indicates that there is no linear relationship between age
and the methylation level for the CpG  (remember: as its name suggests, linear
regression can only be used to model linear relationships between predictors and
outcomes!). In other words, the answer to our question would be: no!

The output of a linear model typically returns the results associated
with the null hypothesis described above (this may not always be the most realistic
or useful null hypothesis, but it is the one we have by default!). To be
more specific, the test compares our observed results with a set of
hypothetical counter-examples of what we would expect to observe if we repeated
the same experiment and analysis over and over again under the null hypothesis.

For this linear model, we can use `tidy()` from the **`broom`** package to
extract detailed information about the coefficients and the associated
hypothesis tests in this model:

```{r tidyfit}
library("broom")
tidy(lm_age_methyl1)
```

```{r rbindfits, echo=FALSE}
# Instead of repeating this model fitting and coefficient extraction by hand, we
# could write a function to fit this kind of model for any given row of the
# matrix:

lm_feature <- function(i, methyl_mat, age) {
    tidy(lm(methyl_mat[i, ] ~ age))[2, ]
}

coefs <- lapply(
    seq_len(nrow(methyl_mat)),
    lm_feature,
    age = age,
    methyl_mat = methyl_mat
)
## bind together all of our small tables to make one big table
coef_df <- do.call(rbind, coefs)
## add a "feature name" column
coef_df$feature <- rownames(methyl_mat)
# plot(coef_df$estimate, -log10(coef_df$p.value),
#     xlab = "Effect size", ylab = bquote(-log[10](p)),
#     pch = 19
# )
```

The standard errors (`std.error`) represent the statistical uncertainty in our
regression coefficient estimates (often referred to as *effect size*). The test
statistics and p-values (`statistic` and `p.value`) represent measures of how (un)likely it would be to observe
results like this under the "null hypothesis". For coefficient $k$ in a
linear model (in our case, it would be the slope), the test statistic calculated in `statistic` above is
a t-statistic given by:

$$
t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)}
$$

$SE\\left(\\hat{\\beta}\_{k}\\right)$ measures the uncertainty we have in our
effect size estimate. Knowing what distribution these t-statistics
follow under the null hypothesis allows us to determine how unlikely it
would be for us to observe what we have under those circumstances, if we
repeated the experiment and analysis over and over again. To
demonstrate how the t-statistics are calculated, we can compute them "by hand":

```{r simtval}
table_age_methyl1 <- tidy(lm_age_methyl1)
tvals <- table_age_methyl1$estimate / table_age_methyl1$std.error
all.equal(tvals, table_age_methyl1$statistic)
```

We can see that the t-statistic is just the ratio between the coefficient estimate
and the standard error.

Calculating the p-values is a bit more tricky. Specifically, it is the
proportion of the distribution of the test statistic under the null
hypothesis that is *as extreme or more extreme* than the observed value
of the test statistic. This is easy to observe visually, by plotting the
theoretical distribution of the test statistic under the null hypothesis
(see next call-out box for more details about it):

```{r tdist, echo=FALSE, fig.cap="The p-value for a regression coefficient represents how often it'd be observed under the null.", fig.alt="Density plot of a t-distribution showing the observed test statistics (here, t-statistics). The p-values, visualised here with shaded regions, represent the portion of the null distribution that is as extreme or more extreme as the observed test statistics, which are shown as dashed lines."}
ggplot() +
    geom_function(fun = function(x) dt(x, df = lm_age_methyl1$df)) +
    xlim(-4, 4) +
    geom_vline(
        aes(xintercept = abs(tvals), color = c("Intercept", "Slope")),
        lty = "dashed"
    ) +
    stat_function(fun = function(x) dt(x, df = lm_age_methyl1$df),
        aes(fill = "Intercept"),
        xlim = c(abs(tvals)[[1]], 4),
        alpha = 0.25,
        geom = "area"
    ) +
    stat_function(fun = function(x) dt(x, df = lm_age_methyl1$df),
        aes(fill = "Slope"),
        xlim = c(abs(tvals)[[2]], 4),
        alpha = 0.25,
        geom = "area"
    ) +
    scale_color_discrete("Parameter", aesthetics = c("fill", "colour")) +
    labs(x = "t-statistic", y = "Density")
```

The red-ish shaded region represents the portion of the distribution of
the test statistic under the null hypothesis that is equal or greater to
the value we observe for the intercept term. As our null hypothesis
relates to a 2-tailed test (as the null hypothesis states that the regression
coefficient is equal to zero, we would reject it if the regression
coefficient is substantially larger **or** smaller than zero), the p-value for
the test is twice the value of the shaded region. In this case, the shaded region
is small relative to the total area of the null distribution; therefore, the
p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = 3)`$). The blue-ish shaded region represents the same measure for the slope term;
this is larger, relative to the total area of the distribution, therefore the
p-value is larger than the one for the intercept term
($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). The
p-value is a function of the test statistic: the ratio between the effect size
we're estimating and the uncertainty we have in that effect. A large effect with large
uncertainty may not lead to a small p-value, and a small effect with
small uncertainty may lead to a small p-value.

:::::::::::::::::::::::::::::::::::::::::  callout

### Calculating p-values from a linear model

Manually calculating the p-value for a linear model is a little bit
more complex than calculating the t-statistic. The intuition posted
above is definitely sufficient for most cases, but for completeness,
here is how we do it:

Since the statistic in a linear model is a t-statistic, it follows a
student t distribution under the null hypothesis, with degrees of
freedom (a parameter of the student t-distribution) given by the
number of observations minus the number of coefficients fitted, in
this case
$`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$.
We want to know what portion of the distribution function of the test
statistic is as extreme as, or more  extreme than, the value we observed.
The function `pt()`(similar to`pnorm()`, etc) can give us this information.

Since we're not sure if the coefficient will be larger or smaller than
zero, we want to do a 2-tailed test. Therefore we take the absolute
value of the t-statistic, and look at the upper rather than lower
tail. In the figure above the shaded areas are only looking at "half" of the
t-distribution (which is symmetric around zero), therefore we multiply the
shaded area by 2 in order to calculate the p-value.

Combining all of this gives us:

```{r simpval}
pvals <- 2 * pt(abs(tvals), df = lm_age_methyl1$df, lower.tail = FALSE)
all.equal(table_age_methyl1$p.value, pvals)
```

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::  challenge

### Challenge 2

In the model we fitted, the estimate for the intercept is 0.902 and its associated
p-value is 0.0129. What does this mean?

:::::::::::::::  solution

### Solution

The first coefficient in a linear model like this is the intercept, which measures
the mean of the outcome (in this case, the methylation value for the first CpG)
when age is zero. In this case, the intercept estimate is 0.902. However, this is
not a particularly noteworthy finding as we do not have any observations with age
zero (nor even any with age \< 20!).

The reported p-value is associated to the following null hypothesis:
the intercept ($\\beta\_0$ above) is equal to zero. Using the usual
significance threshold of 0.05, we reject the null hypothesis as
the p-value is smaller than 0.05. However, it is not really interesting
if this intercept is zero or not, since we probably do not care what the
methylation level is when age is zero. In fact, this question does not
even make much sense! In this example, we are more interested
in the regression coefficient associated to age, as that can tell us
whether there is a linear relationship between age and methylation for the CpG.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

## Fitting a lot of linear models

In the linear model above, we are generally interested in the second regression
coefficient (often referred to as *slope*) which measures the linear relationship
between age and methylation levels. For the first CpG, here is its estimate:

```{r tidyfit2}
coef_age_methyl1 <- tidy(lm_age_methyl1)[2, ]
coef_age_methyl1
```

In this case, the p-value is equal to 0.381 and therefore we cannot reject the null
hypothesis: there is no statistical evidence to suggest that the regression
coefficient associated to age is not equal to zero.

Now, we could do this for every feature (CpG) in the dataset and rank the
results based on their test statistic or associated p-value. However, fitting
models in this way to `r nrow(methylation)` features is not very computationally
efficient, and it would also be laborious to do programmatically. There are ways
to get around this, but first let us talk about what exactly we are doing when
we look at significance tests in this context.

## Sharing information across outcome variables

We are going to introduce an idea that allows us to
take advantage of the fact that we carry out many tests at once on
structured data. We can leverage this fact to *share information*
between model parameters. The insight that we use to perform
*information pooling* or sharing is derived from our knowledge about the
structure of the data. For example, in a high-throughput experiment like
a DNA methylation assay, we know that all of the features were measured
simultaneously, using the same technique. This means that generally, we
expect the base-level variability for each feature to be broadly
similar.

This can enable us to get a better estimate of the uncertainty of model
parameters than we could get if we consider each feature in isolation.
So, to share information between features allows us to get more robust
estimators. Remember that the t-statistic for coefficient $\\beta\_k$ in a
linear model is the ratio between the coefficient estimate and its standard
error:

$$
t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)}
$$

It is clear that large effect sizes will likely lead to small p-values,
as long as the standard error for the coefficent is not large. However,
the standard error is affected by the amount of noise, as we saw
earlier. If we have a small number of observations, it is common for the
noise for some features to be extremely small simply by chance. This, in turn,
causes small p-values for these features, which may give us unwarranted
confidence in the level of certainty we have in the results (false positives).

There are many statistical methods in genomics that use this type of
approach to get better estimates by pooling information between features
that were measured simultaneously using the same techniques. Here we
will focus on the package **`limma`**, which is an established software
package used to fit linear models, originally for the gene expression
micro-arrays that were common in the 2000s, but which is still in use in
RNAseq experiments, among others. The authors of **`limma`** made some
assumptions about the distributions that these follow, and pool
information across genes to get a better estimate of the uncertainty in
effect size estimates. It uses the idea that noise levels should be
similar between features to *moderate* the estimates of the test
statistic by shrinking the estimates of standard errors towards a common
value. This results in a *moderated t-statistic*.

The process of running a model in **`limma`** is somewhat different to what you
may have seen when running linear models. Here, we define a *model matrix* or
*design matrix*, which is a way of representing the
coefficients that should be fit in each linear model. These are used in
similar ways in many different modelling libraries.

:::::::::::::::::::::::::::::::::::::::::  callout

### What is a model matrix?

R fits a regression model by choosing the vector of regression coefficients
that minimises the differences between outcome values and predicted values
using the covariates (or predictor variables). To get predicted values,
we multiply the matrix of predictors by the coefficients. The latter
matrix is called the
**model matrix** (or design matrix). It has one row for each observation and
one column for each predictor, plus one additional column of ones
(the intercept column). Many R libraries contruct the
model matrix behind the scenes, but **`limma`** does not. Usually, the model
matrix can be extracted from a model fit
using the function `model.matrix()`. Here is an example of a model matrix
for the methylation model:

```{r design-age}
design_age <- model.matrix(lm_age_methyl1) # model matrix
head(design_age)
dim(design_age)
```

As you can see, the model matrix has the same number of rows as our
methylation data has samples. It also has two columns - one for the
intercept (similar to the linear model we fit above) and one for age.
This happens "under the hood" when fitting a linear model with `lm()`, but
here we have to specify it directly. The [limma user
manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
has more detail on how to make model matrices for different types of
experimental design, but here we are going to stick with this simple two-variable case.

::::::::::::::::::::::::::::::::::::::::::::::::::

We pass our methylation data to `lmFit()`, specifying
the model matrix. Internally, this function runs `lm()` on each row of
the data in an efficient way. The function `eBayes()`, when applied to the
output of `lmFit()`, performs the pooled estimation of standard errors
that results in the moderated t-statistics and resulting p-values.

```{r lmfit-age}
library("limma")
design_age <- model.matrix(lm_age_methyl1) # model matrix
fit_age <- lmFit(methyl_mat, design = design_age)
fit_age <- eBayes(fit_age)
```

To obtain the results of the linear models, we can use the `topTable()`
function. By default, this returns results for the first coefficient in
the model. As we saw above when using `lm()`, and when we defined
`design_age` above, the first coefficient relates to the intercept term,
which we are not particularly interested in here; therefore we specify
`coef = 2`. Further, `topTable()` by default only returns the top 10
results. To see all of the results in the data, we specify
`number = nrow(fit_age)` to ensure that it returns a row for every row
of the input matrix.

```{r ebayes-toptab}
toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age))
head(toptab_age)
```

The output of `topTable` includes the coefficient, here termed a log
fold change `logFC`, the average level (`aveExpr`), the t-statistic `t`,
the p-value (`P.Value`), and the *adjusted* p-value (`adj.P.Val`). We'll
cover what an adjusted p-value is very shortly. The table also includes
`B`, which represents the log-odds that a feature is signficantly
different, which we won't cover here, but which will generally be a 1-1
transformation of the p-value. The coefficient estimates here are termed
`logFC` for legacy reasons relating to how microarray experiments were
traditionally performed. There are more details on this topic in many
places, for example [this tutorial by Kasper D.
Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html)

Now we have estimates of effect sizes and p-values for the association
between methylation level at each locus and age for our 37 samples. It's
useful to create a plot of effect size estimates (model coefficients)
against p-values for each of these linear models, to visualise the
magnitude of effects and the statistical significance of each. These
plots are often called "volcano plots", because they resemble an
eruption.

```{r limmavolc1, fig.cap="Plotting p-values against effect sizes using limma; the results are similar to a standard linear model.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of age against methylation using limma."}
plot(toptab_age$logFC, -log10(toptab_age$P.Value),
    xlab = "Effect size", ylab = bquote(-log[10](p-value)),
    pch = 19
)
```

In this figure, every point represents a feature of interest. The x-axis
represents the effect size observed for that feature in a linear model,
while the y-axis is the $-\\log\_{10}(\\text{p-value})$, where larger
values indicate increasing statistical evidence of a non-zero effect
size. A positive effect size represents increasing methylation with
increasing age, and a negative effect size represents decreasing
methylation with increasing age. Points higher on the y-axis represent
features for which we think the results we observed would be very
unlikely under the null hypothesis.

Since we want to identify features that have different methylation levels
in different age groups, in an ideal case there would be clear
separation between "null" and "non-null" features. However, usually we
observe results as we do here: there is a continuum of effect sizes and
p-values, with no clear separation between these two classes of
features. While statistical methods exist to derive insights from
continuous measures like these, it is often convenient to obtain a list
of features which we are confident have non-zero effect sizes. This is
made more difficult by the number of tests we perform.

:::::::::::::::::::::::::::::::::::::::  challenge

### Challenge 3

The effect size estimates are very small, and yet many of the p-values
are well below a usual significance level of p \< 0.05. Why is this?

:::::::::::::::  solution

### Solution

Because age has a much larger range than methylation levels, the
unit change in methylation level even for a strong relationship is
very small!

As we mentioned, the p-value is a function of both the effect size
estimate and the uncertainty (standard error) of that estimate.
Because the uncertainty in our estimates is much smaller than the
estimates themselves, the p-values are also small.

If we predicted age using methylation level, it is likely we would see
much larger coefficients, though broadly similar p-values!

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

It is worthwhile considering what exactly the effect of the *moderation*
or information sharing that **`limma`** performs has on our results. To do
this, let us compare the effect sizes estimates and p-values from the two
approaches.

```{r plot-limma-lm-effect, echo=FALSE, fig.cap="Plot of effect sizes using limma vs. those using lm.", fig.alt="A scatter plot of the effect size using limmma vs. those using lm. The plot also shows a straight line through all points showing that the effect sizes are the same."}
plot(
    coef_df[["estimate"]],
    toptab_age[coef_df[["feature"]], "logFC"],
    pch = 16,
    main = "Comparison of effect sizes from limma and lm",
    xlab = "Effect size from lm",
    ylab = "Effect size from limma"
)
abline(0:1, lty = "dashed")
```

These are exactly identical! This is because **`limma`** does not perform
any sharing of information when estimating effect sizes. This is in
contrast to similar packages that apply shrinkage to the effect size
estimates, like **`DESeq2`**. These often use information sharing to shrink
or moderate the effect size estimates, in the case of **`DESeq2`** by again
sharing information between features about sample-to-sample variability.
In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function:

```{r plot-limma-lm-pval, echo=FALSE, fig.cap="Plot of p-values using limma vs. those using lm.", fig.alt="A scatter plot of the p-values using limma vs. those using lm. A straight line is also displayed, showing that the p-values for limma tend to be smaller than those using lm towards the left of the plot and higher towards the right of the plot."}
plot(
    coef_df[["p.value"]],
    toptab_age[coef_df[["feature"]], "P.Value"],
    pch = 16,
    main = "Comparison of p-values from limma and lm",
    xlab = "p-value from lm",
    ylab = "p-value from limma",
    log = "xy"
)
abline(0:1, lty = "dashed")
```

We can see that for the vast majority of features, the results are
broadly similar. There seems to be a minor general tendency for **`limma`**
to produce smaller p-values, but for several features, the p-values from
limma are considerably larger than the p-values from `lm()`. This is
because the information sharing tends to shrink large standard error
estimates downwards and small estimates upwards. When the degree of
statistical significance is due to an abnormally small standard error
rather than a large effect, this effect results in this prominent
reduction in statistical significance, which has been shown to perform
well in case studies. The degree of shrinkage generally depends on the
amount of pooled information and the strength of the evidence
independent of pooling. For example, with very few samples and many
features, information sharing has a larger effect, because there are a
lot of genes that can be used to provide pooled estimates, and the
evidence from the data that this is weighed against is relatively
sparse. In contrast, when there are many samples and few features, there
is not much opportunity to generate pooled estimates, and the evidence
of the data can easily outweigh the pooling.

Shrinkage methods like these ones can be complex to implement and
understand, but it is useful to develop an intuition about why these approaches may be more
precise and sensitive than the naive approach of fitting a model to each
feature separately.

:::::::::::::::::::::::::::::::::::::::  challenge

### Challenge 4

1. Try to run the same kind of linear model with smoking status as
  covariate instead of age, and making a volcano plot. *Note:
  smoking status is stored as* `methylation$smoker`.
2. We saw in the example in the lesson that this information sharing
  can lead to larger p-values. Why might this be preferable?

:::::::::::::::  solution

### Solution

1. The following code runs the same type of model with smoking
  status:

  ```{r limmavolc2, fig.cap="A plot of significance against effect size for a regression of smoking against methylation.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of smoking status against methylation using limma."}
  design_smoke <- model.matrix(~methylation$smoker)
  fit_smoke <- lmFit(methyl_mat, design = design_smoke)
  fit_smoke <- eBayes(fit_smoke)
  toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke))
  plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value),
      xlab = "Effect size", ylab = bquote(-log[10](p)),
      pch = 19
  )
  ```

2. Being a bit more conservative when identifying features can help
  to avoid false discoveries. Furthermore, when rejecting the null
  hypothesis is based more on a small standard error resulting
  from abnormally low levels of variability for a given feature,
  we might want to be a bit more conservative in our expectations.


:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::


:::::::::::::::::::::::::::::::::::::::::  callout

### Shrinkage

Shrinkage is an intuitive term for an effect of information sharing,
and is something observed in a broad range of statistical models.
Often, shrinkage is induced by a *multilevel* modelling approach or by
*Bayesian* methods.

The general idea is that these models incorporate information about
the structure of the data into account when fitting the parameters. We
can share information between features because of our knowledge about
the data structure; this generally requires careful consideration
about how the data were generated and the relationships within.

An example people often use is estimating the effect of attendance on
grades in several schools. We can assume that this effect is similar
in different schools (but maybe not identical), so we can *share
information* about the effect size between schools and shrink our
estimates towards a common value.

For example in **`DESeq2`**, the authors used the observation that genes
with similar expression counts in RNAseq data have similar
*dispersion*, and a better estimate of these dispersion parameters
makes estimates of fold changes much more stable. Similarly, in
**`limma`** the authors made the assumption that in the absence of
biological effects, we can often expect the technical variation in the
measurement of the expression of each of the genes to be broadly
similar. Again, better estimates of variability allow us to prioritise
genes in a more reliable way.

There are many good resources to learn about this type of approach,
including:

- [a blog post by TJ
  Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/)
- [a book by David Robinson](https://gumroad.com/l/empirical-bayes)
- [a (relatively technical) book by Gelman and
  Hill](https://www.stat.columbia.edu/~gelman/arm/)


::::::::::::::::::::::::::::::::::::::::::::::::::


## The problem of multiple tests

With such a large number of features, it would be useful to decide which
features are "interesting" or "significant" for further study. However,
if we were to apply a normal significance threshold of 0.05, it would be likely
we end up with a lot of false positives. This is because a p-value
threshold like this represents a $\\frac{1}{20}$ chance that we observe
results as extreme or more extreme under the null hypothesis (that there
is no assocation between age and methylation level). If we carry out many more
than 20 such tests, we can expect to see situations where, despite the null
hypothesis being true, we observe observe signifiant p-values due to random chance. To
demonstrate this, it is useful to see what happens if we permute (scramble) the age values and
run the same test again:

```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."}

set.seed(123)
age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
design_age_perm <- model.matrix(~age_perm)

fit_age_perm <- lmFit(methyl_mat, design = design_age_perm)
fit_age_perm <- eBayes(fit_age_perm)
toptab_age_perm <- topTable(fit_age_perm, coef = 2, number = nrow(fit_age_perm))

plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
    xlab = "Effect size", ylab = bquote(-log[10](p)),
    pch = 19
)
abline(h = -log10(0.05), lty = "dashed", col = "red")
```

Since we have generated a random sequence of ages, we have no reason to
suspect that there is a true association between methylation levels and
this sequence of random numbers. However, you can see that the p-value
for many features is still lower than a traditional significance level
of $p=0.05$. In fact, here `r sum(toptab_age_perm$P.Value < 0.05)`
features are significant at p \< 0.05. If we were to use this fixed
threshold in a real experiment, it is likely that we would identify many
features as associated with age, when the results we are observing are
simply due to chance.

:::::::::::::::::::::::::::::::::::::::  challenge

### Challenge 5

1. If we run `r nrow(methylation)` tests, even if there are no true differences,
  how many of them (on average) will be statistically significant at
  a threshold of $p \< 0.05$?
2. Why would we want to be conservative in labelling features as
  significantly different? By conservative, we mean to err towards
  labelling true differences as "not significant" rather than vice
  versa.
3. How could we account for a varying number of tests to ensure
  "significant" changes are truly different?

:::::::::::::::  solution

### Solution

1. By default we expect
  $`r nrow(methylation)` \\times 0.05 = `r nrow(methylation) * 0.05`$
  features to be statistically significant under the null
  hypothesis, because p-values should always be uniformly
  distributed under the null hypothesis.
2. Features that we label as "significantly different" will often
  be reported in manuscripts. We may also spend time and money
  investigating them further, computationally or in the lab.
  Therefore, spurious results have a real cost for ourselves and
  for others.
3. One approach to controlling for the number of tests is to divide
  our significance threshold by the number of tests performed.
  This is termed "Bonferroni correction" and we'll discuss this
  further now.


:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

## Adjusting for multiple tests

When performing many statistical tests to categorise features, we are
effectively classifying features as "non-significant" or "significant", that latter meaning those for
which we reject the null hypothesis. We also
generally hope that there is a subset of features for which the null
hypothesis is truly false, as well as many for which the null truly does
hold. We hope that for all features for which the null hypothesis is
true, we accept it, and for all features for which the null hypothesis
is not true, we reject it. As we showed in the example with permuted
age, with a large number of tests it is inevitable that we will get some of
these wrong.

We can think of these features as being "truly different" or "not truly
different"[^1]. Using this idea, we can see that each categorisation we
make falls into four categories:

[^1]: "True difference" is a hard category to rigidly define. As we've
seen, with a lot of data, we can detect tiny differences, and with
little data, we can't detect large differences. However, both can be
argued to be "true".