partialeffects/README.Rmd at main · bansell/partialeffects · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "Seeing is believing"
subtitle: '— correcting datasets for covariates to reveal effects of interest'
output: rmarkdown::github_document
---

## — correcting datasets for covariates to reveal effects of interest {.toc-ignore}

## Part 1 {.toc-ignore}

Linear modelling is a widely used method for testing the effects of predictors on outcomes. In plain English, linear models can help us to understand, for example, the effect of a drug treatment on a disease, or the effect of age on body weight, etc.

## The experiment

Say we are interested in whether a new drug can treat high blood pressure (hypertension). A clinical trial is set up to run for 30 days with 200 patients suffering high blood pressure. Half of the patients are given the drug daily, and half are given a placebo (sugar pill). After 30 days, the blood pressure of all patients is measured.

Linear modelling allows us both to estimate the average difference in blood pressure between treatment groups, and to infer whether the effect is 'significant', i.e., whether the difference between groups could simply occur by chance*.
Importantly for this post, linear modelling also allows us to _correct for confounding variables_ to better resolve (and plot) the effect of interest— which in this example is the treatment.


Let's first load the required packages and check the experimental data

```{r,eval=F,message=F}
devtools::install_github("bansell/partialeffects")

```

```{r,echo=F,message=F}
library(tidyverse)
library(moderndive)
library(skimr)
devtools::load_all()
```

```{r,message=F,error=F,eval=F}
library(tidyverse)
library(moderndive)
library(skimr)
library(partialeffects)
```


```{r}
systolic_trial
```

```{r}
skim(systolic_trial)
```


## The simplest model

For this scenario we can construct a simple linear model in R using the `lm()` function. Here we assume that the drug treatment is the only variable that affects blood pressure.
(For a great introduction to linear modelling in R, check out the excellent (and free) [moderndive](https://moderndive.com/5-regression.html) book.)


```{r}
mod <- lm(systolic ~ treatment, data = systolic_trial)
```


Using the moderndive package, we can call the `get_regression_table()` function to output a clean table of summary statistics:
```{r}
get_regression_table(mod)
```

From this we can see that the mean systolic blood pressure in the placebo-treated ('control') group is 125, whereas the drug treatment lowers the mean pressure by 11.4 units. This is a non-significant difference as indicated by `r paste0('p_value = ', round(summary(mod)$coefficients[2,4],3))`.

## Correcting for confounding

The disappointing result from the above model may be due to other factors or covariates that affect patient blood pressure which are confounding the true affect of treatment. In fact, the trial involved patients of different ages, enrolled patients from two different cities ('sites'), and used a variety of blood pressure machines to take readings. Any or all of these factors/covariates may have confounded the effect of drug treatment.

<br>
</br>
 <aside>*"Factor"* in linear modelling terminology refers to a discrete variable such as treatment status (drug vs placebo). <br>
 *"Covariate"* denotes a continuous variable, such as age.
 *'Factor'*,*'covariate'*,*'term'*,*'beta'* and *'variable'* are often used interchangeably in the lm field. For more information, see the Terminology table in [this paper](https://f1000research.com/articles/9-1444).

 </aside>

<br>
</br>
To account for these possible confounds, we create a 'multivariate' linear model to account for 'multiple variables' as the name suggests.


```{r}
mod_multi <- lm(systolic ~ treatment + age + site + machine,
                data = systolic_trial)

get_regression_table(mod_multi)
```
### Understanding summary statistics

Let's look closely at the summary statistics table above, ignoring the intercept for now. Compared to the simplest model, the drug treatment now appears to reduce blood pressure by nearly 15 units, a highly significant difference.
Among the other factors and covariates:

 * age has a significant positive relationship blood pressure: a 2.4 unit increase in blood pressure is observed _for every standard deviation unit increase_ in age.
 * testing site 2 is associated with vastly lower blood pressure, and
 * machine 2 is associated with significantly higher readings, but not machine 3.

Somewhat confusingly, there is no estimate for the effect of placebo, site 1 or machine 1. Why is this? The convention with linear modelling is to set the intercept to represent the blood pressure measurement when all factors/covariates are set to 0. Practically, this means that the intercept represents blood pressure for those patients at site 1 measured with machine 1. But crucially, it also represents the estimated blood pressure in patients aged *0 years*. The intercept has no real meaning in this case.

Here, the summary statistics for the factors in the model (i.e., ignoring age), would be reported as: _'Relative to patients at site 1, tested with machine 1 and receiving the placebo...'_

### Scaling continuous covariates

When (continuous) covariates are involved, in order to create more interpretable models its recommended to scale these variables before beginning the modelling process. Also called 'de-meaning' or 'taking the z score', scaling subtracts the mean from each value and divides by the standard deviation.

NB although it's bad practice to overwrite variables, we bend the rules in the code below for simplicity.
NBB scaling is tricky to implement in tidy R. A wrapper function `scale_this()` available [here](https://stackoverflow.com/a/35776313/5308040) could be useful if you are using this often.

```{r}
systolic_trial_scaled <- systolic_trial %>%
  mutate(age = (age-mean(age)) / sd(age)) %>%
  print()
```


Now we create a lm using the scaled covariate:
```{r}
mod_scaled <- lm(systolic ~ treatment + age + site + machine,
                 data = systolic_trial_scaled)

get_regression_table(mod_scaled)
```
Note that the estimates for the factors are identical. All that has changed is the intercept and the estimate for age. Now the intercept represents the average blood pressure for patients _of the average age of all study participants_ at site1, measured with machine1. This works because 0 is the average of the scaled age values, and the intercept represents the estimated blood pressure when all covariates are set to 0.

Incidentally, the average age of participants is `r mean(systolic_trial$age)`, so the intercept represents the average blood pressure for patients of this age at the end of the study _correcting for all other variables_ including drug treatment.


## Plotting the treatment effect

For scientific reporting and visual presentations we often want to show the audience the result graphically rather than print a table of summary statistics.

We see in the multivariate model that the treatment is effective. Can we plot this result?

```{r}
ggplot(systolic_trial_scaled, aes(x=treatment, y=systolic, col=treatment)) +
  geom_boxplot() +
  ggpubr::stat_compare_means(ref.group = 'placebo', method = 't.test', col='red') +
  geom_jitter(height=0,width=0.15)
```

The above plot doesnt fit the above reported p value for the drug treatment (p ≈ 0). As a general rule, any box plot where the upper and lower box boundaries overlap cannot be a highly significant difference. Further, the t.test p value calculated from the data in the plot is unchanged from our first simple model.

### How can we show the 'real' treatment effect??

To do this we have to make adjustments the study data to remove the confounding effects we identified. This is called 'residualization'. The systolic blood pressure for every patient in the study is predicted based on the estimates in the linear model. In formal terms, as detailed in [moderndive chapter 5](https://moderndive.com/5-regression.html), the average blood pressure for patients in each treatment/site/machine group, is calculated by linear equation

$$\hat{y} = \beta_{0} + \beta_{1} \cdot x_{1} + \beta_{2} \cdot x_{2} ... +  \beta_{n} \cdot x_{n} $$
Here,
$\hat{y}$ is the estimated blood pressure for patient $x$,
$\beta_{0}$ is the intercept (i.e., average blood pressure for patients when all other $\beta$ covariates are set to 0),
$\beta_{1}$ is the first covariate in the model (treatment), and
$x_{1}$ is the treatment status for patient $x$ expressed in interger terms (0 for placebo, 1 for drug).
$\beta_{2}$ and $x_{2}$ represent the model estimate and measured value for patient ${x}$ for the second term, and so on to the $n^{th}$ term.


If we apply this to Patient 1 in the scaled experimental data:

```{r}
systolic_trial_scaled %>% head()
```
Patient 1 is in the placebo group at site 1, measured with machine 1. The numeric values for Patient 1 for the terms 'treatment: drug', 'site: 2','machine: 2' and 'machine: 3' , are all 0. The age of the patient is 0.73 standard units higher than the average age. To estimate the systolic pressure for patient 1, we substitute the data specific to Patient 1 the linear model as follows:

_systolic = intercept_
_+ (-14.3 \* 'treatment: drug')_
_+ ( 34.57 \* age )_
_+ (-44.86 \* 'site: 2')_
_+ ( 38.737 \* 'machine: 2')_
_+ (-0.307 \* 'machine: 3')_

__which is:__

_systolic = 140.1_
_+ (-14.3 \* 0 )_
_+ ( 34.57 \* 0.738  )_
_+ (-44.86 \* 0 )_
_+ ( 38.74 \* 0 )_
_+ (-0.307 \* 0)_

__giving:__

_systolic = `r round(140.1    + (34.57 * 0.738  ),2)`_

The actual measured value for Patient 1 is 172.89. The difference between the predicted and measured values (`r 172.89 - 165.61`) is the 'residual' or 'error' which is not accounted for by the current model. This is generally attributed to 'random variation' between individuals. Error is denoted __$\epsilon$__ in the formal linear model. Whereas the algebra above solves for the estimated blood pressure for each patient, to calculate the actual measured blood pressure requires adding in the random error which is 'left over' when all other factors/covariates have been set to 0:

$$y=\beta_{0} + \beta_{1} \cdot x_{1} + \beta_{2} \cdot x_{2} ... + \beta_{n} \cdot x_{n} + \epsilon$$
The error can not be estimated by the linear model, and must be calculated afterwards by subtracting the estimated blood pressure from the measured blood pressure (i.e., $y - \hat{y}$ ) for each subject.


## Residualization

At this point it is also worth clarifying the terms 'residuals', 'residualization' and 'partial residuals'.
'Residuals' or 'error' ($\epsilon$) values are those that remain once the effects of all factors and covariates including the intercept, have been removed for every patient in the data set.
'Residualization' is the process of removing the effects of _unwanted_ or _uninteresting_ variables to clarify an effect of interest.
After 'residualization', the set of adjusted measurements are called 'partial residuals'.

In this example the effect of interest is the drug treatment. For Patient 1, we want to calculate the effect of drug treatment by solving the linear model by _adding only the intercept and treatment effects, and then adding the error_.
Patient 1 in fact belongs to the 'intercept' group, so there are no confounding factors to be corrected. They are already set to 0 in the linear equation. However the patient's blood pressure is affected by age, and so to remove this confound from the effect of drug treatment, the residualized data for Patient 1 would be the   _intercept + residual_ (or $\beta_{0} + \epsilon$ ), which is:

Partial treatment effect = _140.1 + `r 172.89 - 165.61` = `r 140.1 + (172.89 - 165.61)`_.

See how the estimated effect of age is simply dropped from the equation during this residualization process.


### extract_partial()

To calculate the partial residuals for drug treatment requires solving the linear model using intercept and treatment terms, then adding the pre-calculated error, for every patient in the study.

```{r,echo=F,include=F}
#source('R/extract_partial.R')
```

`extract_partial()` is a function to do just this. The function is modelled on `moderndive::get_regression_points()` but further calculates partial residuals (`partial_resid`) for effects of interest.

The entire `systolic_trial_scaled` data is captured by the `mod_scaled` object created above. Therefore this function needs only a model object and the column name of the effect of interest:
```{r}

partial_treatment <- extract_partial(mod_scaled, 'treatment')
print(partial_treatment %>% select(-ID))
```

## Plotting the _partial_ treatment effect

```{r}

ggplot(partial_treatment, aes(x=treatment, y=partial_resid, col=treatment)) +
  geom_boxplot() +
  geom_jitter(height=0,width=0.15) +
  ggpubr::stat_compare_means(ref.group = 'placebo', method = 't.test',col='red') +
  ylab('corrected systolic')
```

__Now the summary statistics table, the plot and the displayed p value are all in agreement!__

This approach will scale to handle any number of covariates, and can accommodate mixed effects and generalized linear models. To see examples of plotting the progressive removal of confounding effects, and how `extract_partial()` can work for gene expression analysis, [read on](link).