Skip to content

Commit 50d850a

Browse files
committed
differences for PR #117
1 parent 2f9363f commit 50d850a

13 files changed

+4428
-199
lines changed

03-data-cleaning-and-transformation.md

+125-83
Original file line numberDiff line numberDiff line change
@@ -7,48 +7,63 @@ source: Rmd
77

88
::::::::::::::::::::::::::::::::::::::: objectives
99

10-
- Describe the purpose of an R package and the **`dplyr`** and **`tidyr`** packages.
11-
- Select certain columns in a data frame with the **`dplyr`** function `select`.
12-
- Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter`.
13-
- Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
14-
- Add new columns to a data frame that are functions of existing columns with `mutate`.
10+
- Describe the functions available in the **`dplyr`** and **`tidyr`** packages.
11+
- Recognise and use the following functions: `select()`, `filter()`, `rename()`,
12+
`recode()`, `mutate()` and `arrange()`.
13+
- Combine one or more functions using the 'pipe' operator `%>%`.
1514
- Use the split-apply-combine concept for data analysis.
16-
- Use `summarize`, `group_by`, and `count` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
17-
- Describe the concept of a wide and a long table format and for which purpose those formats are useful.
18-
- Describe what key-value pairs are.
19-
- Reshape a data frame from long to wide format and back with the `pivot_wider` and `pivot_longer` commands from the **`tidyr`** package.
2015
- Export a data frame to a csv file.
2116

2217
::::::::::::::::::::::::::::::::::::::::::::::::::
2318

2419
:::::::::::::::::::::::::::::::::::::::: questions
2520

26-
- How can I select specific rows and/or columns from a data frame?
21+
- How can I create new columns or remove exisitng columns from a data frame?
22+
- How can I rename variables or assign new values to a variable?
2723
- How can I combine multiple commands into a single command?
28-
- How can create new columns or remove existing columns from a data frame?
29-
- How can I reformat a dataframe to meet my needs?
24+
- How can I export my data frame to a csv file?
3025

3126
::::::::::::::::::::::::::::::::::::::::::::::::::
3227

3328

3429

35-
### Getting set up
30+
## Transforming data with `dplyr`
3631

37-
#### Open your R Project file
32+
We are now entering the data cleaning and transforming phase. While it is
33+
possible to do much of the following using Base R functions (in other words,
34+
without loading an external package) `dplyr` makes it much easier. Like many of
35+
the most useful R packages, `dplyr` was developed by data scientist
36+
[http://hadley.nz/](Hadley Wickham).
3837

39-
If you have not already done so, open your R Project file (`library_carpentry.Rproj`) created in the `Before We Start` lesson.
38+
`dplyr` is a package for making tabular data manipulation easier by using a
39+
limited set of functions that can be combined to extract and summarize insights
40+
from your data. It pairs nicely with **`tidyr`** which enables you to swiftly
41+
convert between different data formats (long vs. wide) for plotting and
42+
analysis.
4043

41-
**If you did not complete that step** then do the following:
44+
`dplyr` is also part of the `tidyverse.` Let's make sure we are all on the same
45+
page by loading the `tidyverse` and the `books` dataset we downloaded earlier.
4246

43-
- Under the `File` menu, click on `New project`, choose `New directory`, then
44-
`New project`
45-
- Enter the name `library_carpentry` for this new folder (or "directory"). This
46-
will be your **working directory** for the rest of the day.
47-
- Click on `Create project`
48-
- Create a new file where we will type our scripts. Go to File > New File > R
49-
script. Click the save icon on your toolbar and save your script as
50-
"`script.R`".
51-
- Copy and paste the below lines of code to create three new subdirectories and download the data:
47+
We're going to learn some of the most common **`dplyr`** functions:
48+
49+
- `rename()`: rename columns
50+
- `recode()`: recode values in a column
51+
- `select()`: subset columns
52+
- `filter()`: subset rows on conditions
53+
- `mutate()`: create new columns by using information from other columns
54+
- `group_by()` and `summarize()`: create summary statistics on grouped data
55+
- `arrange()`: sort results
56+
- `count()`: count discrete values
57+
58+
<br>
59+
60+
::::::::::::::::::::::::::::::::::::::::: prereq
61+
62+
### Getting set up
63+
64+
Make sure you have your Rproj and directories set up from the previous episode.
65+
66+
**If you have not completed that step**, run the following in your Rproj:
5267

5368

5469
``` r
@@ -60,9 +75,7 @@ download.file("https://ndownloader.figshare.com/files/22031487",
6075
"data/books.csv", mode = "wb")
6176
```
6277

63-
#### Load the `tidyverse` and data frame into your R session
64-
65-
Load the `tidyverse`
78+
Then load the `tidyverse` package
6679

6780

6881
``` r
@@ -81,42 +94,17 @@ library(tidyverse)
8194
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
8295
```
8396

84-
And the `books` data we saved in the previous lesson.
97+
and the `books` data we saved in the previous lesson.
8598

8699

87100
``` r
88101
books <- read_csv("data/books.csv") # load the data and assign it to books
89102
```
90103

91-
### Transforming data with `dplyr`
92-
93-
We are now entering the data cleaning and transforming phase. While it is
94-
possible to do much of the following using Base R functions (in other words,
95-
without loading an external package) `dplyr` makes it much easier. Like many of
96-
the most useful R packages, `dplyr` was developed by data scientist
97-
[http://hadley.nz/](Hadley Wickham).
98-
99-
`dplyr` is a package for making tabular data manipulation easier by using a
100-
limited set of functions that can be combined to extract and summarize insights
101-
from your data. It pairs nicely with **`tidyr`** which enables you to swiftly
102-
convert between different data formats (long vs. wide) for plotting and
103-
analysis.
104-
105-
`dplyr` is also part of the `tidyverse.` Let's make sure we are all on the same
106-
page by loading the `tidyverse` and the `books` dataset we downloaded earlier.
107-
108-
We're going to learn some of the most common **`dplyr`** functions:
104+
:::::::::::::::::::::::::::::::::::::::::
109105

110-
- `rename()`: rename columns
111-
- `recode()`: recode values in a column
112-
- `select()`: subset columns
113-
- `filter()`: subset rows on conditions
114-
- `mutate()`: create new columns by using information from other columns
115-
- `group_by()` and `summarize()`: create summary statistics on grouped data
116-
- `arrange()`: sort results
117-
- `count()`: count discrete values
118106

119-
### Renaming variables
107+
## Renaming variables
120108

121109
It is often necessary to rename variables to make them more meaningful. If you
122110
print the names of the sample `books` dataset you can see that some of the
@@ -161,6 +149,15 @@ books <- rename(books,
161149
title = X245.ab)
162150
```
163151

152+
::::::::::::::::::::::::::::::::::::::::: callout
153+
154+
### Tip:
155+
156+
When using the `rename()` function, the new variable name comes first.
157+
158+
::::::::::::::::::::::::::::::::::::::::::::::::::
159+
160+
164161
::::::::::::::::::::::::::::::::::::::::: callout
165162

166163
### Side note:
@@ -170,10 +167,12 @@ because R variables cannot start with a number, R automatically inserted an X,
170167
and because pipes | are not allowed in variable names, R replaced it with a
171168
period.
172169

173-
174170
::::::::::::::::::::::::::::::::::::::::::::::::::
175171

176172

173+
You can also rename multiple variables at once by running:
174+
175+
177176
``` r
178177
# rename multiple variables at once
179178
books <- rename(books,
@@ -229,6 +228,7 @@ books <- rename(books,
229228

230229
::::::::::::::::::::::::::::::::::::::::::::::::::
231230

231+
232232
### Recoding values
233233

234234
It is often necessary to recode or reclassify values in your data. For example,
@@ -245,9 +245,9 @@ and `format` (formerly `BCODE2`) variables contain single characters.
245245
<p class="caption">Format (formerly BCODE2) export from Sierra</p>
246246
</div>
247247

248-
You can do this easily using the `recode()` function, also in the `dplyr`
249-
package. Unlike `rename()`, the old value comes first here. Also notice that we
250-
are overwriting the `books$subCollection` variable.
248+
You can reassign names to the values easily using the `recode()` function
249+
from the `dplyr` package. Unlike `rename()`, the old value comes first here.
250+
Also notice that we are overwriting the `books$subCollection` variable.
251251

252252

253253
``` r
@@ -323,9 +323,18 @@ books$format <- recode(books$format,
323323
"4" = "online video")
324324
```
325325

326-
### Subsetting dataframes
326+
Once you have finished recoding the values for the two variables, examine
327+
the dataset again and check to see if you have different values this time:
328+
329+
330+
``` r
331+
distinct(books, subCollection, format)
332+
```
333+
327334

328-
#### Subsetting using `filter()` in the `dplyr` package
335+
## Subsetting dataframes
336+
337+
### **Subset rows with `filter()`**
329338

330339
In the last lesson we learned how to subset a data frame using brackets. As with
331340
other R functions, the `dplyr` package makes it much more straightforward, using
@@ -383,7 +392,7 @@ serial_microform <- filter(books, format %in% c("serial", "microform"))
383392

384393
::::::::::::::::::::::::::::::::::::::: challenge
385394

386-
### Filtering with `filter()`
395+
### Exercise: Subsetting with `filter()`
387396

388397
1. Use `filter()` to create a data frame called `booksJuv` consisting of `format` books and `subCollection` juvenile materials.
389398

@@ -409,7 +418,10 @@ mean(booksJuv$tot_chkout)
409418

410419
::::::::::::::::::::::::::::::::::::::::::::::::::
411420

412-
### Selecting variables
421+
<br>
422+
423+
424+
### **Subset columns with `select()`**
413425

414426
The `select()` function allows you to keep or remove specific columns It also
415427
provides a convenient way to reorder variables.
@@ -446,7 +458,25 @@ books <- select(books, -location)
446458
booksReordered <- select(books, title, tot_chkout, loutdate, everything())
447459
```
448460

449-
### Ordering data
461+
::::::::::::::::::::::::::::::::::::::: callout
462+
463+
### Tips:
464+
465+
If your variable name has spaces, use `` to close the variable:
466+
467+
468+
``` r
469+
select(df, -`my variable`)
470+
```
471+
472+
``` error
473+
Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "function"
474+
```
475+
476+
:::::::::::::::::::::::::::::::::::::::
477+
478+
479+
## Ordering data with `arrange()`
450480

451481
The `arrange()` function in the `dplyr` package allows you to sort your data by alphabetical or numerical order.
452482

@@ -482,7 +512,8 @@ booksHighestChkout
482512
booksChkoutYear <- arrange(books, desc(tot_chkout), desc(pubyear))
483513
```
484514

485-
### Creating new variables
515+
516+
## Creating new variables with `mutate()`
486517

487518
The `mutate()` function allows you to create new variables. Here, we use the
488519
`str_sub()` function from the `stringr` package to extract the first character
@@ -514,7 +545,9 @@ Caused by warning:
514545

515546
We see the error message `NAs introduced by coercion`. This is because non-numerical variables become `NA` and the remainder become integers.
516547

517-
### Putting it all together with %>%
548+
549+
550+
## Putting it all together with %>%
518551

519552
The [Pipe Operator](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) `%>%` is
520553
loaded with the `tidyverse`. It takes the output of one statement and makes it
@@ -557,7 +590,7 @@ myBooks
557590

558591
::::::::::::::::::::::::::::::::::::::: challenge
559592

560-
### Playing with pipes `%>%`
593+
### Exercise: Playing with pipes `%>%`
561594

562595
1. Create a new data frame `booksKids` with these conditions:
563596

@@ -589,14 +622,18 @@ mean(booksKids$tot_chkout)
589622

590623
::::::::::::::::::::::::::::::::::::::::::::::::::
591624

592-
### Split-apply-combine data analysis and the `summarize()` function
625+
626+
### Split-apply-combine and the `summarize()` function
593627

594628
Many data analysis tasks can be approached using the *split-apply-combine*
595629
paradigm: split the data into groups, apply some analysis to each group, and
596630
then combine the results. **`dplyr`** makes this very easy through the use of
597631
the `group_by()` function.
598632

599-
##### The `summarize()` function
633+
<br>
634+
635+
636+
#### The `summarize()` function
600637

601638
`group_by()` is often used together with `summarize()`, which collapses each
602639
group into a single-row summary of that group. `group_by()` takes as arguments
@@ -673,7 +710,10 @@ Let's break this down step by step:
673710
- Finally, we arrange `sum_tot_chkout` in descending order, so we can see the
674711
class with the most total checkouts. We can see it is the `E` class (History of America), followed by `NA` (items with no call number data), followed by `H` (Social Sciences) and `P` (Language and Literature).
675712

676-
### Pattern matching
713+
<br>
714+
715+
716+
### **Pattern matching**
677717

678718
Cleaning text with the `stringr` package is easier when you have a basic understanding of 'regex', or regular expression pattern matching. Regex is especially useful for manipulating strings (alphanumeric data), and is the backbone of search-and-replace operations in most applications. Pattern matching is common to all programming languages but regex syntax is often code-language specific. Below, find an example of using pattern matching to find and replace data in R:
679719

@@ -709,7 +749,9 @@ books %>%
709749
# ℹ 9,990 more rows
710750
```
711751

712-
### Exporting data
752+
753+
754+
## Exporting data
713755

714756
Now that you have learned how to use **`dplyr`** to extract information from
715757
or summarize your raw data, you may want to export these new data sets to share
@@ -718,14 +760,11 @@ them with your collaborators or for archival.
718760
Similar to the `read_csv()` function used for reading CSV files into R, there is
719761
a `write_csv()` function that generates CSV files from data frames.
720762

721-
Before using `write_csv()`, we are going to create a new folder, `data_output`,
722-
in our working directory that will store this generated dataset. We don't want
723-
to write generated datasets in the same directory as our raw data. It's good
724-
practice to keep them separate. The `data` folder should only contain the raw,
763+
We have previously created the directory `data_output` which we will use to
764+
store our generated dataset. It is good practice to keep your input and output
765+
data in separate folders--the `data` folder should only contain the raw,
725766
unaltered data, and should be left alone to make sure we don't delete or modify
726-
it. In contrast, our script will generate the contents of the `data_output`
727-
directory, so even if the files it contains are deleted, we can always
728-
re-generate them.
767+
it.
729768

730769
In preparation for our next lesson on plotting, we are going to create a
731770
version of the dataset with most of the changes we made above. We will first read in the original, then make all the changes with pipes.
@@ -784,22 +823,25 @@ We now write it to a CSV and put it in the `data/output` sub-directory:
784823
write_csv(books_reformatted, "./data_output/books_reformatted.csv")
785824
```
786825

826+
787827
## Help with dplyr
788828

789829
- Read more about `dplyr` at [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/).
790830
- In your console, after loading `library(dplyr)`, run `vignette("dplyr")` to read an extremely helpful explanation of how to use it.
791831
- See the [http://r4ds.had.co.nz/transform.html]("Data Transformation" chapter) in Garrett Grolemund and Hadley Wickham's book *R for Data Science.*
792832
- Watch this Data School video: [https://www.youtube.com/watch?v=jWjqLW-u3hc](Hands-on dplyr tutorial for faster data manipulation in R.)
793833

794-
## Wrangling dataframes with tidyr
795834

796835
:::::::::::::::::::::::::::::::::::::::: keypoints
797836

798837
- Use the `dplyr` package to manipulate dataframes.
799-
- Use `select()` to choose variables from a dataframe.
800-
- Use `filter()` to choose data based on values.
801-
- Use `group_by()` and `summarize()` to work with subsets of data.
838+
- Subset data frames using `select()` and `filter()`.
839+
- Rename variables in a data frame using `rename()`.
840+
- Recode values in a data frame using `recode()`.
802841
- Use `mutate()` to create new variables.
842+
- Sort data using `arrange()`.
843+
- Use `group_by()` and `summarize()` to work with subsets of data.
844+
- Use pipe (`%>%`) to combine multiple commands.
803845

804846
::::::::::::::::::::::::::::::::::::::::::::::::::
805847

0 commit comments

Comments
 (0)