You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Describe the purpose of an R package and the **`dplyr`** and **`tidyr`** packages.
11
-
- Select certain columns in a data frame with the **`dplyr`** function `select`.
12
-
- Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter`.
13
-
- Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
14
-
- Add new columns to a data frame that are functions of existing columns with `mutate`.
10
+
- Describe the functions available in the **`dplyr`** and **`tidyr`** packages.
11
+
- Recognise and use the following functions: `select()`, `filter()`, `rename()`,
12
+
`recode()`, `mutate()` and `arrange()`.
13
+
- Combine one or more functions using the 'pipe' operator `%>%`.
15
14
- Use the split-apply-combine concept for data analysis.
16
-
- Use `summarize`, `group_by`, and `count` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
17
-
- Describe the concept of a wide and a long table format and for which purpose those formats are useful.
18
-
- Describe what key-value pairs are.
19
-
- Reshape a data frame from long to wide format and back with the `pivot_wider` and `pivot_longer` commands from the **`tidyr`** package.
### Split-apply-combine data analysis and the `summarize()` function
625
+
626
+
### Split-apply-combine and the `summarize()` function
593
627
594
628
Many data analysis tasks can be approached using the *split-apply-combine*
595
629
paradigm: split the data into groups, apply some analysis to each group, and
596
630
then combine the results. **`dplyr`** makes this very easy through the use of
597
631
the `group_by()` function.
598
632
599
-
##### The `summarize()` function
633
+
<br>
634
+
635
+
636
+
#### The `summarize()` function
600
637
601
638
`group_by()` is often used together with `summarize()`, which collapses each
602
639
group into a single-row summary of that group. `group_by()` takes as arguments
@@ -673,7 +710,10 @@ Let's break this down step by step:
673
710
- Finally, we arrange `sum_tot_chkout` in descending order, so we can see the
674
711
class with the most total checkouts. We can see it is the `E` class (History of America), followed by `NA` (items with no call number data), followed by `H` (Social Sciences) and `P` (Language and Literature).
675
712
676
-
### Pattern matching
713
+
<br>
714
+
715
+
716
+
### **Pattern matching**
677
717
678
718
Cleaning text with the `stringr` package is easier when you have a basic understanding of 'regex', or regular expression pattern matching. Regex is especially useful for manipulating strings (alphanumeric data), and is the backbone of search-and-replace operations in most applications. Pattern matching is common to all programming languages but regex syntax is often code-language specific. Below, find an example of using pattern matching to find and replace data in R:
679
719
@@ -709,7 +749,9 @@ books %>%
709
749
# ℹ 9,990 more rows
710
750
```
711
751
712
-
### Exporting data
752
+
753
+
754
+
## Exporting data
713
755
714
756
Now that you have learned how to use **`dplyr`** to extract information from
715
757
or summarize your raw data, you may want to export these new data sets to share
@@ -718,14 +760,11 @@ them with your collaborators or for archival.
718
760
Similar to the `read_csv()` function used for reading CSV files into R, there is
719
761
a `write_csv()` function that generates CSV files from data frames.
720
762
721
-
Before using `write_csv()`, we are going to create a new folder, `data_output`,
722
-
in our working directory that will store this generated dataset. We don't want
723
-
to write generated datasets in the same directory as our raw data. It's good
724
-
practice to keep them separate. The `data` folder should only contain the raw,
763
+
We have previously created the directory `data_output` which we will use to
764
+
store our generated dataset. It is good practice to keep your input and output
765
+
data in separate folders--the `data` folder should only contain the raw,
725
766
unaltered data, and should be left alone to make sure we don't delete or modify
726
-
it. In contrast, our script will generate the contents of the `data_output`
727
-
directory, so even if the files it contains are deleted, we can always
728
-
re-generate them.
767
+
it.
729
768
730
769
In preparation for our next lesson on plotting, we are going to create a
731
770
version of the dataset with most of the changes we made above. We will first read in the original, then make all the changes with pipes.
@@ -784,22 +823,25 @@ We now write it to a CSV and put it in the `data/output` sub-directory:
0 commit comments