Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to episode-03 structure #117

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 118 additions & 82 deletions episodes/03-data-cleaning-and-transformation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,26 +7,21 @@ source: Rmd

::::::::::::::::::::::::::::::::::::::: objectives

- Describe the purpose of an R package and the **`dplyr`** and **`tidyr`** packages.
- Select certain columns in a data frame with the **`dplyr`** function `select`.
- Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter`.
- Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
- Add new columns to a data frame that are functions of existing columns with `mutate`.
- Describe the functions available in the **`dplyr`** and **`tidyr`** packages.
- Recognise and use the following functions: `select()`, `filter()`, `rename()`,
`recode()`, `mutate()` and `arrange()`.
- Combine one or more functions using the 'pipe' operator `%>%`.
- Use the split-apply-combine concept for data analysis.
- Use `summarize`, `group_by`, and `count` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
- Describe the concept of a wide and a long table format and for which purpose those formats are useful.
- Describe what key-value pairs are.
- Reshape a data frame from long to wide format and back with the `pivot_wider` and `pivot_longer` commands from the **`tidyr`** package.
- Export a data frame to a csv file.

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: questions

- How can I select specific rows and/or columns from a data frame?
- How can I create new columns or remove exisitng columns from a data frame?
- How can I rename variables or assign new values to a variable?
- How can I combine multiple commands into a single command?
- How can create new columns or remove existing columns from a data frame?
- How can I reformat a dataframe to meet my needs?
- How can I export my data frame to a csv file?

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -36,48 +31,7 @@ library(readr)
books <- read_csv("./data/books.csv")
```

### Getting set up

#### Open your R Project file

If you have not already done so, open your R Project file (`library_carpentry.Rproj`) created in the `Before We Start` lesson.

**If you did not complete that step** then do the following:

- Under the `File` menu, click on `New project`, choose `New directory`, then
`New project`
- Enter the name `library_carpentry` for this new folder (or "directory"). This
will be your **working directory** for the rest of the day.
- Click on `Create project`
- Create a new file where we will type our scripts. Go to File > New File > R
script. Click the save icon on your toolbar and save your script as
"`script.R`".
- Copy and paste the below lines of code to create three new subdirectories and download the data:

```{r create-dirs, eval=FALSE}
library(fs) # https://fs.r-lib.org/. fs is a cross-platform, uniform interface to file system operations via R.
dir_create("data")
dir_create("data_output")
dir_create("fig_output")
download.file("https://ndownloader.figshare.com/files/22031487",
"data/books.csv", mode = "wb")
```

#### Load the `tidyverse` and data frame into your R session

Load the `tidyverse`

```{r load-data, purl=FALSE}
library(tidyverse)
```

And the `books` data we saved in the previous lesson.

```{r, eval=FALSE, purl=FALSE}
books <- read_csv("data/books.csv") # load the data and assign it to books
```

### Transforming data with `dplyr`
## Transforming data with `dplyr`

We are now entering the data cleaning and transforming phase. While it is
possible to do much of the following using Base R functions (in other words,
Expand Down Expand Up @@ -105,7 +59,41 @@ We're going to learn some of the most common **`dplyr`** functions:
- `arrange()`: sort results
- `count()`: count discrete values

### Renaming variables
<br>

::::::::::::::::::::::::::::::::::::::::: prereq

### Getting set up

Make sure you have your Rproj and directories set up from the previous episode.

**If you have not completed that step**, run the following in your Rproj:

```{r create-dirs, eval=FALSE}
library(fs) # https://fs.r-lib.org/. fs is a cross-platform, uniform interface to file system operations via R.
dir_create("data")
dir_create("data_output")
dir_create("fig_output")
download.file("https://ndownloader.figshare.com/files/22031487",
"data/books.csv", mode = "wb")
```

Then load the `tidyverse` package

```{r load-data, purl=FALSE}
library(tidyverse)
```

and the `books` data we saved in the previous lesson.

```{r, eval=FALSE, purl=FALSE}
books <- read_csv("data/books.csv") # load the data and assign it to books
```

:::::::::::::::::::::::::::::::::::::::::


## Renaming variables

It is often necessary to rename variables to make them more meaningful. If you
print the names of the sample `books` dataset you can see that some of the
Expand All @@ -131,6 +119,15 @@ books <- rename(books,
title = X245.ab)
```

::::::::::::::::::::::::::::::::::::::::: callout

### Tip:

When using the `rename()` function, the new variable name comes first.

::::::::::::::::::::::::::::::::::::::::::::::::::


::::::::::::::::::::::::::::::::::::::::: callout

### Side note:
Expand All @@ -140,9 +137,11 @@ because R variables cannot start with a number, R automatically inserted an X,
and because pipes | are not allowed in variable names, R replaced it with a
period.


::::::::::::::::::::::::::::::::::::::::::::::::::


You can also rename multiple variables at once by running:

```{r, purl=FALSE}
# rename multiple variables at once
books <- rename(books,
Expand Down Expand Up @@ -178,6 +177,7 @@ books <- rename(books,

::::::::::::::::::::::::::::::::::::::::::::::::::


### Recoding values

It is often necessary to recode or reclassify values in your data. For example,
Expand All @@ -192,9 +192,9 @@ knitr::include_graphics("fig/BCODE1.png")
knitr::include_graphics("fig/BCODE2.png")
```

You can do this easily using the `recode()` function, also in the `dplyr`
package. Unlike `rename()`, the old value comes first here. Also notice that we
are overwriting the `books$subCollection` variable.
You can reassign names to the values easily using the `recode()` function
from the `dplyr` package. Unlike `rename()`, the old value comes first here.
Also notice that we are overwriting the `books$subCollection` variable.

```{r, comment=FALSE}
# first print to the console all of the unique values you will need to recode
Expand Down Expand Up @@ -231,9 +231,17 @@ books$format <- recode(books$format,
"4" = "online video")
```

### Subsetting dataframes
Once you have finished recoding the values for the two variables, examine
the dataset again and check to see if you have different values this time:

#### Subsetting using `filter()` in the `dplyr` package
```{r, purl=FALSE, eval=FALSE}
distinct(books, subCollection, format)
```


## Subsetting dataframes

### **Subset rows with `filter()`**

In the last lesson we learned how to subset a data frame using brackets. As with
other R functions, the `dplyr` package makes it much more straightforward, using
Expand Down Expand Up @@ -277,7 +285,7 @@ serial_microform <- filter(books, format %in% c("serial", "microform"))

::::::::::::::::::::::::::::::::::::::: challenge

### Filtering with `filter()`
### Exercise: Subsetting with `filter()`

1. Use `filter()` to create a data frame called `booksJuv` consisting of `format` books and `subCollection` juvenile materials.

Expand All @@ -298,7 +306,10 @@ mean(booksJuv$tot_chkout)

::::::::::::::::::::::::::::::::::::::::::::::::::

### Selecting variables
<br>


### **Subset columns with `select()`**

The `select()` function allows you to keep or remove specific columns It also
provides a convenient way to reorder variables.
Expand All @@ -315,7 +326,20 @@ books <- select(books, -location)
booksReordered <- select(books, title, tot_chkout, loutdate, everything())
```

### Ordering data
::::::::::::::::::::::::::::::::::::::: callout

### Tips:

If your variable name has spaces, use `` to close the variable:

```{r, comment=NA}
select(df, -`my variable`)
```

:::::::::::::::::::::::::::::::::::::::


## Ordering data with `arrange()`

The `arrange()` function in the `dplyr` package allows you to sort your data by alphabetical or numerical order.

Expand All @@ -330,7 +354,8 @@ booksHighestChkout
booksChkoutYear <- arrange(books, desc(tot_chkout), desc(pubyear))
```

### Creating new variables

## Creating new variables with `mutate()`

The `mutate()` function allows you to create new variables. Here, we use the
`str_sub()` function from the `stringr` package to extract the first character
Expand All @@ -353,7 +378,9 @@ books <- mutate(books, pubyear = as.integer(pubyear))

We see the error message `NAs introduced by coercion`. This is because non-numerical variables become `NA` and the remainder become integers.

### Putting it all together with %>%


## Putting it all together with %>%

The [Pipe Operator](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) `%>%` is
loaded with the `tidyverse`. It takes the output of one statement and makes it
Expand All @@ -378,7 +405,7 @@ myBooks

::::::::::::::::::::::::::::::::::::::: challenge

### Playing with pipes `%>%`
### Exercise: Playing with pipes `%>%`

1. Create a new data frame `booksKids` with these conditions:

Expand All @@ -405,14 +432,18 @@ mean(booksKids$tot_chkout)

::::::::::::::::::::::::::::::::::::::::::::::::::

### Split-apply-combine data analysis and the `summarize()` function

### Split-apply-combine and the `summarize()` function

Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. **`dplyr`** makes this very easy through the use of
the `group_by()` function.

##### The `summarize()` function
<br>


#### The `summarize()` function

`group_by()` is often used together with `summarize()`, which collapses each
group into a single-row summary of that group. `group_by()` takes as arguments
Expand Down Expand Up @@ -454,7 +485,10 @@ Let's break this down step by step:
- Finally, we arrange `sum_tot_chkout` in descending order, so we can see the
class with the most total checkouts. We can see it is the `E` class (History of America), followed by `NA` (items with no call number data), followed by `H` (Social Sciences) and `P` (Language and Literature).

### Pattern matching
<br>


### **Pattern matching**

Cleaning text with the `stringr` package is easier when you have a basic understanding of 'regex', or regular expression pattern matching. Regex is especially useful for manipulating strings (alphanumeric data), and is the backbone of search-and-replace operations in most applications. Pattern matching is common to all programming languages but regex syntax is often code-language specific. Below, find an example of using pattern matching to find and replace data in R:

Expand All @@ -472,7 +506,9 @@ books %>%
select(title_modified, title)
```

### Exporting data


## Exporting data

Now that you have learned how to use **`dplyr`** to extract information from
or summarize your raw data, you may want to export these new data sets to share
Expand All @@ -481,14 +517,11 @@ them with your collaborators or for archival.
Similar to the `read_csv()` function used for reading CSV files into R, there is
a `write_csv()` function that generates CSV files from data frames.

Before using `write_csv()`, we are going to create a new folder, `data_output`,
in our working directory that will store this generated dataset. We don't want
to write generated datasets in the same directory as our raw data. It's good
practice to keep them separate. The `data` folder should only contain the raw,
We have previously created the directory `data_output` which we will use to
store our generated dataset. It is good practice to keep your input and output
data in separate folders--the `data` folder should only contain the raw,
unaltered data, and should be left alone to make sure we don't delete or modify
it. In contrast, our script will generate the contents of the `data_output`
directory, so even if the files it contains are deleted, we can always
re-generate them.
it.

In preparation for our next lesson on plotting, we are going to create a
version of the dataset with most of the changes we made above. We will first read in the original, then make all the changes with pipes.
Expand Down Expand Up @@ -545,22 +578,25 @@ We now write it to a CSV and put it in the `data/output` sub-directory:
write_csv(books_reformatted, "./data_output/books_reformatted.csv")
```


## Help with dplyr

- Read more about `dplyr` at [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/).
- In your console, after loading `library(dplyr)`, run `vignette("dplyr")` to read an extremely helpful explanation of how to use it.
- See the [http://r4ds.had.co.nz/transform.html]("Data Transformation" chapter) in Garrett Grolemund and Hadley Wickham's book *R for Data Science.*
- Watch this Data School video: [https://www.youtube.com/watch?v=jWjqLW-u3hc](Hands-on dplyr tutorial for faster data manipulation in R.)

## Wrangling dataframes with tidyr

:::::::::::::::::::::::::::::::::::::::: keypoints

- Use the `dplyr` package to manipulate dataframes.
- Use `select()` to choose variables from a dataframe.
- Use `filter()` to choose data based on values.
- Use `group_by()` and `summarize()` to work with subsets of data.
- Subset data frames using `select()` and `filter()`.
- Rename variables in a data frame using `rename()`.
- Recode values in a data frame using `recode()`.
- Use `mutate()` to create new variables.
- Sort data using `arrange()`.
- Use `group_by()` and `summarize()` to work with subsets of data.
- Use pipe (`%>%`) to combine multiple commands.

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand Down
Loading
Loading