-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy pathtidyverse_2.Rmd
280 lines (209 loc) · 16.6 KB
/
tidyverse_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "Untitled"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This is the second session covering tidyverse commands. We will show you how to use some new functions, but we will also use some of the functions we previously used last week which will be good for extra practice. If you missed the previous session, that should not be a problem.
From last time, we looked at this [cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) to occasionally guide us.
It's typical to load all the libraries you need at the top of your code.
```{r}
library(tidyverse)
```
Let's quickly reload the two data sets we used last time and clean them up. As a reminder, we converted the mask use data from wide format into long format, but let's keep the data in both versions today.
```{r, include=FALSE}
# load mask data
mask_use_wide <- read_delim(file="mask-use-by-county.csv", delim=",")
mask_use_long <- mask_use_wide %>%
pivot_longer(cols = -COUNTYFP, names_to = "MaskUseResponse", values_to = "MaskUseProportion") %>%
rename("fips" = COUNTYFP) %>%
arrange(fips)
# load case count data, filter based on most recent date
cases <- read_delim(file="https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv", delim=",")
cases_latest <- cases %>% filter(date == "2020-08-03") %>%
dplyr::select(-date) %>%
arrange(fips)
```
These data from last time are all at the county level, which vary considerably by size. So for instance if we wanted to know the fraction of positive cases in a state, or if we wanted to know what mask-wearing behavior was in a particular state, we should really take this variability in county size into account. It's also useful to see which counties have been hit particularly hard, since any positive cases may be expected if the county is quite large.
Let's load in some additional data on population size by county, which I obtained from www.census.gov.
```{r, include=FALSE}
pop_sizes <- read_delim(file="co-est2019-annres.csv.xz", delim=",")
```
When I'm working with larger files and I want to make sure they looks like I expect them to, instead of opening up the file and scrolling through all the lines, checking each by eye, I typically use various commands to just get an idea of what it looks like before proceeding. Oftentimes, this is sufficient; we don't need to check every line.
These are just a few commands I might use to look at the first and last few lines, how many rows there are overall, and also get an idea of how much missing data there might be:
```{r}
head(pop_sizes)
nrow(pop_sizes)
sum(!is.na(pop_sizes$'2019')) # how many counties have size estimates for 2019? it's even closer to the 3142 we expect!
tail(pop_sizes)
```
Ok so these data are a little messy and need to be cleaned up.
From these commands, we can see that the entire US was included in this table, which is not a county. We should remove that. Also, county names are all preceded with a ".", something we'll deal with in a moment. The number of rows in the table is very similar to what we expect (3,142 according to wikipedia), but maybe larger by 7 rows or so. The row containing the entire US contributes to this excess of rows beyond our expectation.
Looking at the number of entries for the year 2019, they're extremely close to what we expect.
Looking at the bottom of the table, we see some footnotes were left in. These are probably contributing to the extra number of rows and should absolutely be removed.
Let's remove these rows at the beginning and end with the slice() function (unlike the select() function that removes columns).
```{r, include=FALSE}
pop_sizes <- pop_sizes %>% dplyr::slice(2:3143) # this also gets rid of the notes at the bottom
```
As you can see in the cheat sheet, there are many ways to select rows, even at random! We won't do this, but this could be useful if you wanted to set up an analysis on a smaller subset of your data to make sure everything runs correctly but quickly. Then, after everything is set up, run your analyses on the entire data, which could take a while with huge data sets.
The 'Geographic Area' column has two pieces of info in it: county and state. We should split this column into two columns so that we can separately access the info. For instance, later on we will do an analysis not by county but by state, summing up across a state's counties, and this requires having this information in it's own separate column. To separate this info into 2 seperate columns, we will use the separate() function:
```{r, include=FALSE}
pop_sizes <- pop_sizes %>% separate(col = 'Geographic Area',
into = c("County", "State"),
sep = ", ")
```
This illustrates how even though we split our data up by commas when we loaded it in above, we can further split columns based on specific characters. Above, we asked the read_delim() function to split by commas, so you would think this column/field would also get separated, but when you look at the raw data file you can see that data from each column is contained within quotes, and this info within quotes is separated by commas.
And just so you know, there's a related function called unite() that does just the opposite: combining columns into a single column.
Let's get rid of extra characters in the new 'County' column. We don't need the "." that precedes each name, and it's pointless that each individual county name is followed by "County"... we know they're counties based on the name of the column. We can use the mutate() function to add new rows with new names, but if the name of the new row we want is the same as an existing one, it just replaces it!
```{r, include=FALSE}
pop_sizes <- pop_sizes %>%
mutate(County = str_remove(string = County,pattern = ".")) %>%
mutate(County = str_remove(string = County,pattern = " County"))
```
This table includes population size data for quite a few years. Let's just get the most recent population estimate, selecting the 2019 column with the select() function:
```{r, include=FALSE}
pop_sizes <- pop_sizes %>%
dplyr::select(County, State, "2019")
```
As we did above for the data sets we used last week, we can also combine all of these individual commands into one compact command that might look something like this:
```{r, include=FALSE}
pop_sizes <- read_delim(file="co-est2019-annres.csv", delim=",") %>%
dplyr::slice(2:3143) %>%
separate(col = 'Geographic Area', into = c("County", "State"), sep = ", ") %>%
mutate(County = str_remove(string = County,pattern = ".")) %>%
mutate(County = str_remove(string = County,pattern = " County")) %>%
dplyr::select(County, State, "2019")
```
We can take a quick peek at these population size data, now that they're cleaned, using the summary() function, which shows there's a ton of variability in county sizes!
```{r}
summary(pop_sizes)
```
Let's combine these data on population sizes with our previous data on case counts and mask usage, as it may reveal interesting dynamics that vary by county size.
We will first use the join command on the case count data and the county population size data, joining by BOTH county and state. Why can't we just join by county?
```{r, include=FALSE}
cases_popsize <- inner_join(x=cases_latest, y=pop_sizes,
by=c("county"="County", "state"="State"))
```
Using this table with case counts and population size, let's add in the mask data. For now, let's not add in all the mask data. Let's only incorporate the the proportion of people who are "ALWAYS" masked for each county
```{r, include=FALSE}
cases_popsize_masked <- inner_join(x=cases_popsize,
y=dplyr::select(mask_use_wide, c(COUNTYFP, ALWAYS)),
by=c("fips" = "COUNTYFP"))
```
Looking at this data table, for each county we now have cases, deaths, population size, and the proportion of people who are always masked.
You may also notice the population size data had a column entitled "2019" for size estimates during that year. Let's make this more informative and change it to "pop_size", and rename the ALWAYS column so that we know it's referring to always masked.
```{r, include=FALSE}
cases_popsize_masked <- cases_popsize_masked %>%
rename("pop_size" = "2019", "AlwaysMasked" = ALWAYS)
```
# Visualizing data and finding outliers
Let's quickly probe these data just to get an idea of what they look like. The main goal here is to teach you how to use R/tidyverse commands to quickly and easily explore your data. Again, due to the context of these data, let's keep any results as descriptive observations. We can't say anything conclusive without more complicated analyses that we'll leave to the public health officials.
Let's assume these case count data are accurately measuring the number of people who are getting coronavirus infections (which is a horrible assumption because TONS of younger people may be completely asymptomatic and never get tested). Assuming this, what fraction of a county's population is testing positive, and how does this fraction vary with population size?
```{r}
cases_popsize_masked <- cases_popsize_masked %>%
mutate(FracPos = cases/pop_size)
```
Do counties with large populations have a higher proportion of case counts? Let's use a simple plotting function to plot population size on the x axis and proportion of cases on y axis. Since we will look at the proportion of cases by dividing the data in the case counts column by the data in the population size column, let's just make a new column called 'FracPos' that contains this information
We can make a quick plot of these data using a base R (i.e. not tidyverse) plotting function:
```{r}
cases_popsize_masked <- cases_popsize_masked %>%
mutate(FracPos = cases/pop_size)
plot(x=cases_popsize_masked$pop_size,
y=cases_popsize_masked$FracPos,
log="x",
ylab="Fraction of cases",
xlab="pop size")
# DON'T RUN CORRELATION ANALYSES WITH HETEROSCEDASTICITY
```
Looks interesting. Counties with larger population sizes may look like they have higher proportion of cases, but it's a little complicated because there's a lot of variability for counties with smaller population sizes.
One thing that might catch our eye are those outliers with very high rates of positive cases? What counties are theses, and in what states? We can very easily check this with the following:
```{r}
cases_popsize_masked %>%
filter(FracPos > 0.08) %>%
arrange(desc(FracPos))
```
Let's add another column for death rate, and visualize the relationship between the fraction of positive cases and the death rate:
```{r}
cases_popsize_masked <- cases_popsize_masked %>%
mutate(FracDeaths = deaths/cases)
plot(x=cases_popsize_masked$FracPos,
y=cases_popsize_masked$FracDeaths,
log="",
xlab="FracCases",
ylab="FracDeaths")
```
What are these counties that have been severely impacted by deaths? Is this something to be concerned about, or is this just noise from counties with small sizes and/or small case counts?
```{r}
cases_popsize_masked %>%
filter(FracDeaths > 0.15) %>%
arrange(desc(FracDeaths))
```
# group_by() and summarize() functions are extremely easy and powerful
The use of the group_by() and summarize() functions allows us to quickly get extremely informative information from our data.
We'll show how they're useful in many ways, but I'd like to start with showing the importance of changing your data from long format to wide format.
Let's go back to our original mask use data table, not the combined one. What if we wanted to quickly get the overall mean values of ALWAYS, ... , NEVER across all the counties? We could use the original data in wide format, and calculate the mean for each of these columns. We could also use the simple summary() function which outputs a table:
```{r}
mean(mask_use_wide$NEVER)
mean(mask_use_wide$RARELY)
mean(mask_use_wide$SOMETIMES)
mean(mask_use_wide$FREQUENTLY)
mean(mask_use_wide$ALWAYS)
summary(mask_use_wide)
```
There's nothing wrong with doing it this way, but it's not very elegant and would get more difficult if our data were more complicated. Also, the summary function outputs a table that is not the easiest to analyze.
Another way of calculating these values is to convert our mask data into long format where these mask-wearing frequency categories (NEVER, ... , ALWAYS) are different values of a categorical variable (MaskUseResponse), and we tell R that we want to do an analysis separately for each of these different categorical variable values, where here this analysis is just calculating the mean. In other words, for all the rows with values of NEVER for MaskUseResponse, calculate a mean separately, and repeat for all the other possible MaskUseResponse values. We tell R to treat these different categorical variable values separately using group_by(), and the summarize() function can then use these groupings to perform analyses, here calculating the mean.
```{r}
mask_use_long %>%
group_by(MaskUseResponse) %>%
summarize(mean(MaskUseProportion))
```
As you can see, this also outputs a tibble, which we can use for other downstream analyses.
Looking at the cheat sheet, you will see summarize can take a variety of functions.
Let's explore another example of group_by() and summarize() using our tibble that has all our combined information.
Instead of focusing on counties as our unit of analysis, we can use group_by() to easily switch to a state level analysis by telling the summarize() function to perform analyses for each state, combining all the rows that have the same value under the "state" column. When we combine the group_by() function with summarize(), we can easily do some pretty powerful analyses that would take a lot of effort using other programs.
```{r}
cases_popsize_masked %>%
group_by(state) %>%
summarize(MeanMasked = mean(AlwaysMasked)) %>%
arrange(desc(MeanMasked))
```
You can comment out the line containing the group_by() statement to see it's effect.
Note that this creates a new tibble, and that a new column is created called "MeanMasked" that calculates the mean as a function of whatever variable was put into group_by(). So what is happening here is because we gave "state" to group_by(), R goes through all the rows of AlwaysMasked and if they have the same value for "state", from different counties from the same state, it uses all that data for the state to calculate a mean.
I'd like to point out that this is potentially a bad analysis because we are calculating a mean for a state based on it's counties, and some of these counties may be represented by fewer people. So if we truly wanted the average behavior of a state, we should weight each county by its population size. Why should a county with a size of 400 contribute equally to the mean as a county of size 40,000?
The following code calculates a custom mean value, where each county is weighted by its population size:
```{r}
cases_popsize_masked %>%
group_by(state) %>%
summarize(WeightedMeanMasked = sum(AlwaysMasked*pop_size)/sum(pop_size)) %>%
arrange(desc(WeightedMeanMasked))
```
See the element-wise vector math figure provided in the github repository for a graphical explanation of what's being done here:
Did this weighting by county size actually make a difference? Let's store these analyses as 'x' and 'y' and compare them:
```{r}
x <- cases_popsize_masked %>%
group_by(state) %>%
summarize(MeanMasked = mean(AlwaysMasked)) %>%
arrange(state)
y <- cases_popsize_masked %>%
group_by(state) %>%
summarize(WeightedMeanMasked = sum(AlwaysMasked*pop_size)/sum(pop_size)) %>%
arrange(state)
hist(x$MeanMasked - y$WeightedMeanMasked)
```
It looks like on average, AlwaysMasked estimates that don't take pop_size into account are systematically lower than those that do. When we don't take county size into account, we effectively give more weight to smaller counties. Thus, if by doing this we produce lower estimates of AlwaysMasked by state, this suggests that smaller counties in general have lower values for AlwaysMasked. Is this true?
```{r}
plot(x=cases_popsize_masked$pop_size,
y=cases_popsize_masked$AlwaysMasked,
log="x",
xlab="pop size",
ylab="AlwaysMasked")
```
# Summary:
## New tidyverse functions used today:
- slice() to select specific rows by index number
- separate() to take a column with multiple pieces of information and split it into multiple columns
- mutate() with str_remove to get rid of characters or words we don't want, there are many similar functions in the stringr package, which has its own cheat sheet
- select() to
- group_by() to categorize our data by the values in a particular column (here, by state)
- summarize() to calculate simple but very informative statistics, such as mean and variance