CHM210-Assignment-3-Tutorial/CHM210_Assignment3 tutorial.rmd at main · UofTChem-Teaching/CHM210-Assignment-3-Tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "CHM210 Tutorial 3"
output:
  html_document:
    toc: false
    number_sections: true

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

In this tutorial we will be using R to generate interactive pollutant maps for ozone ($O_3$), nitrogen dioxide ($NO_2$), and odd oxygen ($O_X$) in Ontario. We will use pollutant data from the National Air Pollution Surveillance (NAPS) system, which measures pollutant concentrations at 40 stations throughout Ontario. Previously, we looked at NAPS data from a single measurement station over a period of several days. Here we are looking at pollutant measurements from all stations across Ontario averaged for a single 24-hour period.

As was noted in the previous tutorial, the [*R for Environmental Chemistry*](https://davidrosshall.github.io/R4EnvChem/) textbook is available if additional information is needed on the basic functioning of the R programming language.

# Part A: Importing packages and data sets

## Importing the required packages

In order to complete this tutorial and follow along with the examples, you will need to install the `leaflet` and `tidyverse` packages. If you have not installed these packages in R previously, please run the following code in the R console.

```{r install-pkgs, eval = FALSE}
install.packages("leaflet")
install.packages("tidyverse")
```

If you already have these packages installed, you can simply load them into your R session using the `library()` function, as shown below.

```{r load-pkgs, message = FALSE, warning = FALSE}
x <- c("leaflet", "tidyverse")

lapply(x, library, character.only = TRUE)

```

Now that we have the required packages loaded, let's move on to importing our data set.

## Importing the data sets

As usual, our first step will be importing our data using the `read.csv()` function.

```{r import-data, message = FALSE}
on <- read_csv("2018-01-01_ON.csv")
```

Let's take a quick look at our data.

```{r}
head(on)
```

Oops! You may see that our imported data doesn't look right!

It is very common for data files to be distributed with meta-data in the headers which explain the parameters of how the data was collected. This is bad for importing our data, because R then tries to read that meta data into data-frame, which generally doesn't work out. Fortunately, there's an easy way to fix this. If you open the original data file (either in your file browser, or in RStudio, you'll see that the file contains 45 lines of meta-data, followed by the table header. By adding the `skip = 46` setting to our `read.csv()` call, we can tell R to ignore the meta-data and just read our file's data.

```{r}
on <- read.csv("2018-01-01_ON.csv", skip = 46)
```


Let's take a quick look at our data. The first column indicates the NAPS station ID number. The next few columns are pretty self-explanatory. Concentration measurements for $NO_2$ and $O_3$ are reported as 24-hour averages in units of parts-per-billion (ppb). You may note that like last time, the date column is considered as a `chr` rather than as a date object, we don't really care about this because we will not be plotting a time series in this tutorial.

```{r view-dat}
head(on)
```

## Removing NA values

As we did previously, we want to check our data for error-coded data.

```{r}
sum(is.na(on))
min(on$O3, na.rm = TRUE)
min(on$NO2, na.rm = TRUE)
```
You will see that there are 6 NAs, while the `min()` gives normal values.

You might notice that station #60106 in Ottawa did not record any ozone or nitrogen dioxide concentrations on this date. A few other stations have also failed to record data on this date.

Let's remove rows that don't have any concentration data recorded. We will use `na.omit()` to remove rows containing NA values. An example is shown below.

```{r remove-na}
# Remove rows containing NAs
on <- na.omit(on)
head(on)
```

The rows containing NA values have now been removed. We can proceed with generating some odd oxygen data.

## Generate odd oxygen data

Recall from the last assignment that $NO_2$ and $O_3$ are two pollutants which are intimately linked.

$O_3$ can be converted to $NO_2$ through reaction with NO and $NO_2$ may be subsequently cycled back to $O_3$ via photolysis. This null cycle happens quickly enough during the day that it is convenient to define the sum of $O_3$ and $NO_2$ as *odd oxygen*.

To calculate the odd oxygen concentration for each time point, you will need to add the $NO_2$ concentration and $O_3$ concentration for each row. Let's use `mutate()` again to add a new column to `on`. This time we'll call our new column `Ox`, and it will be filled with the sum of the `NO2` and `O3` columns. An example is shown below.

```{r add-Ox}
on <- mutate(on, Ox = NO2 + O3)
head(on)
```

We can see that the odd oxygen column of `on` is now populated with odd oxygen concentrations for each time point. Now we can proceed with creating our pollutant maps.

# Part B: Creating pollutant maps

The package `leaflet` allows us to create pollutant maps in R. We will use the function `leaflet()` to make our map object. An example is shown in the following section.

## Creating a basic map

`leaflet()` is a very powerful function that can create a simple map in just a single line of code, as shown in the code block below.

```{r}
m <- leaflet(on) %>% addTiles() %>% addCircleMarkers(lat = ~latitude, lng = ~longitude, label = ~paste(on$O3, "ppb"), radius = 1)

m

```

Let's break the code above down step-by-step. Our `leaflet()` call creates the map. This function simply creates an interactive map widget in R, but does not add any visual attributes of the map (i.e., map tiles or pollutant markers), as shown below.

```{r leaflet-test}
leaflet(on)
```

In order to add visual attributes to our map widget, we will use `addTiles()`. This will add a base map to our map widget, so we can see where in the world our pollutant markers are being plotted. An example is shown below.

```{r base-map}
ONT <- leaflet(on) %>% addTiles()
```

Note that when we call `leaflet()`, we need to specify the name of our data set within the brackets. We then use the pipe operator, `%>%`, to feed the output of our call of `leaflet()` into our call of `addTiles()`. This is similar to the use of the `+` operator when creating `ggplot` objects, which we used in the previous computational assignment. We will call our map object `ONT`, using the assignment operator `<-`.

Now let's view our base map.

```{r base-map-show}
ONT
```

You can see that we have added a world map tile to our map widget. You can zoom in and out of the map by scrolling, or by using the `+` and `-` buttons in the top left corner of the widget.

Let's proceed to adding some pollutant markers.

## Add pollutant markers

As mentioned in Part A, our data set contains ozone and nitrogen dioxide measurements in parts-per-billion (ppb) collected at 40 NAPS stations in Ontario. Let's plot the ozone measurements on our `ONT` map, using the latitude and longitude data for each station to inform the location of the measurement.

We will use `addCircleMarkers()` to add pollutant markers to our `ONT` map. Just like when we called `addTiles()`, we will use the pipe operator, `%>%`, to add the pollutant markers onto our existing map widget.

Inside of the brackets of `addCircleMarkers()` we will need to specify a few arguments. First, we will specify our data set using `data = on`. We will then specify the latitude and longitude columns of the data set using `lat = ~latitude` and `long = ~longitude`. Notice that we use `~` before specifying the name of the latitude and longitude columns. This is similar to specifying `aes()` when using `ggplot` functions; it maps the data in the specified columns to the pollutant markers which we are trying to plot.

We also need to specify what the pollutant marker labels should say. We want the labels to indicate the ozone concentration measured at a given site, in units of ppb. We will specify this by indicating `label = paste(on$O3, "ppb")`. We need to use `paste()` to indicate that we want the labels to show the concentration data from the `O3` column of `on`, followed by units in ppb.

An example is shown below.

``` {r add-ozone}
ONT %>% addCircleMarkers(data = on, lat = ~latitude, lng = ~longitude,
                   label = paste(on$O3, "ppb"), radius = 1)
```

As you can see, the output of our original code-block gives the same result as this more extended explanation.

```{r}
m <- leaflet(on) %>% addTiles() %>% addCircleMarkers(lat = ~latitude, lng = ~longitude, label = ~paste(on$O3, "ppb"), radius = 1)

m
```

Do you understand why the two methods give the same result? If you are unsure, try creating a new codeblock below and copying all of the code we used for the `ONT` map into it so that you can see it all together.

## Colour markers by pollutant concentration

You can see that our map now has circular markers at each location where ozone concentrations are measured by NAPS, which show the ozone concentration measured at that location when you hover your mouse over the circles. This is great, but the map would be more visually informative if we colored the circles by pollutant concentration. We'll learn to do this in the following section.

Let's map the color of our ozone pollutant markers to the ozone concentration. We will do this by dividing ozone concentrations into bins, based on how high or low the concentration is at a given site.

Before we decide what concentration ranges we should use for our bins, we should try to get a better idea of how our data is distributed. We can do this using the `hist()` function to quickly plot a histogram, and using `min()` and `max()` to know our data's outer boundaries. let's look at the minimum and maximum ozone concentrations logged for this day in Ontario. We can do this using `min()` and `max()`. Recall from the previous assignment that we can refer to a specific column of our data set using the `$` operator; we can refer to the `O3` column of `on` by writing `on$O3` inside of the brackets of the min and max functions.

Examples are shown below.

```{r conc-range}
hist(on$O3)

min(on$O3)
max(on$O3)
```

As you can see from the output above, the minimum ozone concentration measured in Ontario on this day was 13.16 ppb, and the maximum concentration was 37.68 ppb. Let's divide our colour scheme into 4 levels:

1. Concentrations less than 10 ppb (<10 ppb)
2. Concentrations between 10 and 20 ppb (10-20 ppb)
3. Concentrations between 20 and 30 ppb (20-30 ppb)
4. Concentrations greater than 30 ppb (30+ ppb)

We will divide our ozone concentration data using `cut()`. Inside the brackets we need to specify the data set and column of data we are binning (`on$O3`), a vector including the breakpoints for each concentration bin (`c(0, 10, 20, 30, 40)`), whether we want to include the lowest value (`include.lowest = T`), and a vector including the names of each bin (`c()`). An example is shown below.

```{r conc-bins}
on$O3Lvl <- cut(on$O3,
                c(0,10,20,30,40), include.lowest = T,
                labels = c('<10 ppb', '10-20 ppb',
                           '20-30 ppb', '30+ ppb'))
```

Note that we assign the output of `cut()` to a new column of `on` called `O3Lvl` using the assignment operator, `<-`. This is required for us to map our new ozone concentration bins to our color scheme.

We will assign a color palette to our concentration levels using `colorFactor()`. Inside the brackets we will need to specify the color palette we want to use, in this case let's use 'Blues' by indicating `palette = 'Blues'`, and the concentration levels which we are mapping the color scheme to (`on$O3Lvl`). We will assign our color scheme to a variable titled `O3Col` using the assignment operator. An example is shown below.

```{r conc-col}
O3Col <- colorFactor(palette = 'Blues', on$O3Lvl)
```

We can now replot our pollutant markers with our new color scheme, by adding `color = ~O3Col(O3Lvl)` to our previous call of `addCircleMarkers()`. We also need to specify that the pollutant we are plotting in this call is ozone, by indicating `group = "Ozone"`. This is important as we will be adding $NO_2$ data to the same map, and if we do not specify which group the concentrations belong to, then you will not be able to distinguish between $O_3$ and $NO_2$ readings collected at the same site.

You may also want to customize the appearance of the circles, such as the border or the opacity of the markers, by using `stroke = FALSE` or `fillOpacity = 0.7`.

```{r plot-ozone-col}
ONT <- ONT %>% addCircleMarkers(data = on, lat = ~latitude, lng = ~longitude,
                                label = paste(on$O3, "ppb"), color = ~O3Col(O3Lvl),
                                stroke = FALSE, fillOpacity = 0.7,
                                group = "Ozone")
```

Now let's look at our updated map.

``` {r show-ozone}
ONT
```

Our markers are now colored according to the ozone concentrations, with higher concentrations represented by circles in darker hues of blue. Let's add a legend to ease interpretation of the color scheme. We will do so in the following section.

## Add legend

To add a legend to our map, we will use `addLegend()`. Again, we need to use the pipe operator, `%>%`, to add our new element to our existing map widget, `ONT`.

Inside of the brackets, there are a few things we need to specify. First, we need to indicate where we want the legend to appear in our map widget (`'bottomright'`). We then need to specify the color palette we created, `pal = O3Col`, and the values which were used to create the palette, `values = on$O3LVL`. We also want to add a title to our legend, so we know which pollutant the legend corresponds to. You can add a title by specifying `title = 'Daily average ozone concentration'` inside of the brackets of `addLegend()`. You might want to add `<br>` somewhere in the title so that it is written on two lines instead of one, making it easier to fit into the legend. An example is shown below.

```{r add-legend}
ONT <- ONT %>% addLegend('bottomright', pal = O3Col, values = on$O3Lvl,
                         title = 'Daily average<br>ozone concentration')
```

Now let's view our map.

```{r oz-legend}
ONT
```

Beautiful! Let's add our nitrogen dioxide data to the map.

## Add nitrogen dioxide pollutant markers

We will add nitrogen dioxide concentrations to our map using the same steps we performed above. Let's start out by creating concentration bins for the $NO_2$ data. We will pick our bins based on the range of concentration values in the `on$NO2` column. Let's determine the range using `hist()` `min()` and `max()`.

```{r no2-range}
hist(on$NO2)
min(on$NO2)
max(on$NO2)
```

The minimum $NO_2$ concentration measured on this day was 1 ppb, while the maximum concentration was 27.72 ppb. Let's divide our $NO_2$ concentrations into 6 bins:

1. Concentrations lower than 5 ppb (<5 ppb)
2. Concentrations between 5 and 10 ppb (5-10 ppb)
3. Concentrations between 10 and 15 ppb (10-15 ppb)
4. Concentrations between 15 and 20 ppb (15-20 ppb)
5. Concentrations between 20 and 25 ppb (20-25 ppb)
6. Concentrations greater than 25 ppb (25+ ppb)

We divide our $NO_2$ concentrations into bins using `cut()`, the same way we binned our ozone data. An example is shown below.

```{r bin-no2}
on$NO2Lvl <- cut(on$NO2,
                c(0,5,10,15,20,25,30), include.lowest = T,
                labels = c('<5 ppb', '5-10 ppb', '10-15 ppb',
                           '15-20 ppb', '20-25 ppb', '25+ ppb'))
```

Now that we have divided our concentration data into bins, let's map the concentrations to a color scheme. We will do this using `colorFactor()`, the same way we did with our ozone data. Let's use the color palette `Greens` instead of `Blues`, so we can distinguish between ozone and $NO_2$ pollutant markers on our map. An example is shown below.

```{r no2-col}
NO2Col <- colorFactor(palette = 'Greens', on$NO2Lvl)
```

Now that we've created our color scheme, we can add $NO_2$ markers to our existing map. We will use code similar to what we used for the ozone data, with a few changes to inputs where we previously referenced the ozone column of the `on` data set. An example is shown below.

```{r add-NO2}
ONT <- ONT %>% addCircleMarkers(data = on, lng = ~longitude,
                                lat = ~latitude, stroke = F,
                                color = ~NO2Col(NO2Lvl),
                                label = paste(on$NO2, "ppb"),
                                fillOpacity = 0.7,
                                group = "Nitrogen dioxide") %>%
  addLegend('bottomleft', pal = NO2Col, values = on$NO2Lvl,
            title = 'Daily average<br>nitrogen dioxide<br>concentration')

ONT
```

## Add odd oxygen pollutant markers

Let's repeat the steps above for the odd oxygen data. First we'll determine the minimum and maximum odd oxygen concentrations measured on this day across Ontario.

```{r Ox-range}
hist(on$Ox)
min(on$Ox)
max(on$Ox)
```

The minimum $O_X$ concentration measured on this day was 34.8 ppb, while the maximum concentration was 43.48 ppb. Let's divide our $O_X$ concentrations into 4 bins:

1. Concentrations lower than 35 ppb (<35 ppb)
2. Concentrations between 35 and 40 ppb (35-40 ppb)
3. Concentrations between 40 and 45 ppb (40-45 ppb)
4. Concentrations greater than 45 ppb (45+ ppb)

We divide our $O_X$ concentrations into bins using `cut()`, the same as done previously. An example is shown below.

```{r bin-Ox}
on$OxLvl <- cut(on$Ox,
                c(0,35,40,45,50), include.lowest = T,
                labels = c('<35 ppb', '35-40 ppb', '40-45 ppb',
                           '45+ ppb'))
```

Now that we have divided our concentration data into bins, let's map the concentrations to a color scheme. We will do this using `colorFactor()`. Let's use the color palette `Reds` so we can distinguish between pollutant markers on our map. An example is shown below.

```{r Ox-col}
OxCol <- colorFactor(palette = 'Reds', on$OxLvl)
```

Now that we've created our color scheme, we can add $O_X$ markers to our existing map.

```{r add-Ox-map}
ONT <- ONT %>% addCircleMarkers(data = on, lng = ~longitude,
                                lat = ~latitude, stroke = F,
                                color = ~OxCol(OxLvl),
                                label = paste(on$Ox, "ppb"),
                                fillOpacity = 0.7,
                                group = "Odd oxygen") %>%
  addLegend('topleft', pal = OxCol, values = on$OxLvl,
            title = 'Daily average<br>odd oxygen<br>concentration')

ONT
```

Looks great, but our pollutant markers are overlapping, as the concentration measurements were collected at the same locations. We will add another legend which controls which pollutant is plotted on the map in the following section.

## Show/hide layers

To add a legend which controls the pollutant layers, we will use `addLayersControl()`. Inside of the brackets we need to specify what pollutants are plotted, using `overlayGroups = c("Nitrogen dioxide", "Ozone", "Odd oxygen")`. Note that the groups listed in this vector must match the `group` argument from `addCircleMarkers()`.

```{r show-hide}
ONT <- ONT %>% addLayersControl(overlayGroups = c("Nitrogen dioxide", "Ozone",
                                                  "Odd oxygen"))

ONT
```

You can see from the above output that our map now has a legend in the top right corner which allows us to choose between which pollutant we want to view.

##Setting Color palettes automatically

While the approach above gives the user a great deal of control over their figure, it is also possible to allow R to set the colour palette automatically. A brief example is shown below.

```{r}
pal <- colorNumeric(
  palette = "Blues", domain = on$O3
)

m <- leaflet(on) %>% addTiles() %>% addCircleMarkers(lat = ~latitude, lng = ~longitude,
                                                     label = ~paste(on$O3, "ppb"), color = ~pal(O3),
                                                     group = "Ozone", stroke = FALSE, fillOpacity = 0.7)%>%
  addLegend("bottomright", pal = pal, values = on$O3, title = "Daily average ozone<br>concentration (ppb)")

m

```

Compare with:

```{r}
ONT <- leaflet(on) %>% addTiles() %>% addCircleMarkers(data = on, lat = ~latitude, lng = ~longitude,
                                label = paste(on$O3, "ppb"), color = ~O3Col(O3Lvl),
                                stroke = FALSE, fillOpacity = 0.7,
                                group = "Ozone")%>%addLegend('bottomright', pal = O3Col, values = on$O3Lvl,
                         title = 'Daily average<br>ozone concentration')

ONT

```

As you can see, while the two figures are formatted slightly differently, substantively they are quite similar, and indeed, by setting the colors automatically, you can make a figure with better colour resolution, which would be tedious to achieve with manual formatting. Both methods are useful, automatically formatting the colours can be more convenient, but if more control is desired, it is useful to know how to format the colours manually.

##Setting colours on non-linear scales

While the data values in this tutorial followed a semi-Gaussian distribution, which was nicely behaved with equally distributed color bins, often one can have non-Gaussian distributions of data, which are more convenient to plot using a different distribution. A brief example is shown below for a logarithmic distribution.
```{r}
pal <- colorNumeric(
  palette = "Blues", domain = log(on$O3,10)
)

n <- leaflet(on) %>% addTiles() %>% addCircleMarkers(lat = ~latitude, lng = ~longitude,
                                                     label = ~paste(on$O3, "ppb"), color = ~pal(log(O3,10)),
                                                     group = "Ozone", stroke = FALSE, fillOpacity = 0.7)%>%
  addLegend("bottomright", pal = pal, values = log(on$O3,10), title = "Daily average<br>Ozone<br>concentration log(ppb)", labFormat = labelFormat(transform = function(x) 10^x))

n
```

Note the use of `transform = function(x) 10^x`, which is used so that the legend values can be set on `log(on$o3,10)`, but then reversed so that the labels use the original values of on$o3 instead of the transformed values.

If you zoom in on Toronto in plot n, and compare with the same level of zoom on plot m, you will see that in plot n the 5 sites within Toronto are slightly different shades of blue, while in plot m the 5 sites are almost indistinguishable visually. This is due to the logarithmic transformation increasing the resolution used to plot the colours. One should be cautious to avoid over-accentuating the differences between similar measurements by making frivolous transformations of your data, but in some cases you may find it useful to plot log-transformed versions of your data.