Skip to content

Cleaning_sum_25 #714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 87 additions & 73 deletions modules/Data_Cleaning/Data_Cleaning.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -74,16 +74,16 @@ Types of "missing" data:
## Finding Missing data {.small}

- `is.na` - looks for `NAN` and `NA`
- `is.nan`- looks for `NAN`
- `is.infinite` - looks for Inf or -Inf

```{r, echo=FALSE}
NA_vect<- c(0,NA, -1)
NA_vect <- NA_vect/0
```

```{r}
test <- c(0,NA, -1)
test/0
test <- test/0
is.na(test)
is.nan(test)
is.infinite(test)
is.na(NA_vect)
is.infinite(NA_vect)
```


Expand All @@ -92,8 +92,8 @@ is.infinite(test)
`any()` can help you check if there are any `NA` values in a vector

```{r}
test
any(is.na(test))
NA_vect
any(is.na(NA_vect))
```


Expand Down Expand Up @@ -378,7 +378,7 @@ B. include `& is.na()`

## Summary

- `is.na()`,`any(is.na())`, `all(is.na())`,`count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
- `is.na()`,`any(is.na())`, `count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
- `miss_var_which()` can help you drop columns that have any missing values.
- `filter()` automatically removes `NA` values - can't confirm or deny
if condition is met (need `| is.na()` to keep them)
Expand All @@ -387,7 +387,7 @@ B. include `& is.na()`
- `NA` values can change your calculation results
- think about what `NA` values represent - don't drop them if you shouldn't
- `na_if()` will make `NA` values for a particular value
- `replace_na()` will replace `NA values with a particular value
- `replace_na()` will replace `NA` values with a particular value

## Lab Part 1

Expand Down Expand Up @@ -512,15 +512,15 @@ Note that automatically values not reassigned explicitly by
{data_input} %>%
mutate({variable_to_fix} = case_when({Variable_fixing}
/some condition/ ~ {value_for_con},
TRUE ~ {value_for_not_meeting_condition})
.default = {value_for_not_meeting_condition})

```
:::

{value_for_not_meeting_condition} could be something new
or it can be the original values of the column

## case_when with TRUE ~ original variable name
## case_when with .default = original variable name

```{r}
data_ginger_mint %>%
Expand All @@ -529,7 +529,7 @@ data_ginger_mint %>%
Treatment == "Mint" ~ "Peppermint",
Treatment == "mint" ~ "Peppermint",
Treatment == "peppermint" ~ "Peppermint",
TRUE ~ Treatment)) %>%
.default = Treatment)) %>%
count(Treatment, Treatment_recoded)
```

Expand All @@ -544,35 +544,23 @@ data_ginger_mint %>%
Treatment == "Mint" ~ "Peppermint",
Treatment == "mint" ~ "Peppermint",
Treatment == "peppermint" ~ "Peppermint",
TRUE ~ Treatment)) %>%
.default = Treatment)) %>%
count(Treatment, Treatment_recoded)
```


## But maybe we want NA?

Perhaps we want values that are O or Other to actually be NA, then `case_when` can be helpful for this. We simply specify everything else.

```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
Treatment == "Ginger" ~ "Ginger",
Treatment == "Mint" ~ "Peppermint",
Treatment == "mint" ~ "Peppermint",
Treatment == "peppermint" ~ "Peppermint")) %>%
count(Treatment, Treatment_recoded)
```
## case_when() can also overwrite/update a variable

You need to specify what we want in the first part of `mutate`.

```{r}
data_ginger_mint %>%
mutate(Treatment = case_when(
Treatment == "Ginger" ~ "Ginger",
Treatment == "Mint" ~ "Peppermint",
Treatment == "mint" ~ "Peppermint",
Treatment == "peppermint" ~ "Peppermint")) %>%
Treatment == "peppermint" ~ "Peppermint",
.default = Treatment)) %>%
count(Treatment)

```
Expand All @@ -584,16 +572,29 @@ data_ginger_mint %>%
```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
Treatment == "Ginger" ~ "Ginger", # keep it the same!
Treatment %in%
c("Mint", "mint", "Peppermint", "peppermint") ~ "Peppermint",
Treatment %in% c("O", "Other") ~ "Other")) %>%
Treatment %in% c("O", "Other") ~ "Other",
.default = Treatment)) %>%

count(Treatment, Treatment_recoded)

```

## But maybe we want NA?

Perhaps we want values that are O or Other to actually be NA, then `case_when` can be helpful for this. We could specify everything else and drop `.default = Treatment` or we could specify NA directly with `NA_character_`

```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
Treatment %in%
c("Mint", "mint", "Peppermint", "peppermint") ~ "Peppermint",
Treatment %in% c("O", "Other") ~ NA_character_,
.default = Treatment)) %>%

count(Treatment, Treatment_recoded)
```

## Another reason for `case_when()`

Expand All @@ -619,13 +620,26 @@ data_ginger_mint %>%
count(Group, Effect)
```

## GUT CHECK: If we want all unspecified values to remain the same with `case_when()`, how should we complete the `TRUE ~` statement?
## GUT CHECK: If we want all unspecified values to remain the same with `case_when()`, how should we complete the `.default =` statement?

A. With the name of the variable we are modifying or using as source

B. With the word "same"


## Other Functions/Arguments you might see

`.default = ` used to be `TRUE ~`

```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
Treatment %in%
c("Mint", "mint", "Peppermint", "peppermint") ~ "Peppermint",
Treatment %in% c("O", "Other") ~ NA_character_,
TRUE ~ Treatment)) %>%
```

# Working with strings

## Strings in R
Expand Down Expand Up @@ -726,19 +740,18 @@ data_ginger_mint %>%
Treatment %in%
c("Mint", "mint", "Peppermint", "peppermint") ~ "Peppermint",
Treatment %in% c("O", "Other") ~ "Other",
TRUE ~ Treatment))
.default = Treatment))
```

## `case_when()` improved with `stringr`
`^` indicates the beginning of a character string
`$` indicates the end


```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
str_detect(string = Treatment, pattern = "int") ~ "Peppermint",
str_detect(string = Treatment, pattern = "^o|^O") ~ "Other",
TRUE ~ Treatment)) %>%
str_detect(string = Treatment, pattern = "o|O") ~ "Other",
.default = Treatment)) %>%
count(Treatment, Treatment_recoded)
```

Expand All @@ -759,6 +772,37 @@ B. `str_find()`

C. `str_detect()`

## A bit on Regular Expressions

- <http://www.regular-expressions.info/reference.html>
- They can use to match a large number of strings in one statement
- `.` matches any single character
- `*` means repeat as many (even if 0) more times the last character
- `?` makes the last thing optional
- `^` matches start of vector `^a` - starts with "a"
- `$` matches end of vector `b$` - ends with "b"


## Things you might see

`^` indicates the beginning of a character string (placed before)
`$` indicates the end (placed after)

```{r}
data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
str_detect(string = Treatment, pattern = "t$") ~ "Peppermint",
.default = Treatment)) %>%
count(Treatment, Treatment_recoded)

data_ginger_mint %>%
mutate(Treatment_recoded = case_when(
str_detect(string = Treatment, pattern = "^m") ~ "Mint Tea",
.default = Treatment)) %>%
count(Treatment, Treatment_recoded)

```

# Separating and uniting data

## Uniting columns
Expand Down Expand Up @@ -790,11 +834,10 @@ data_comb <- data_comb %>%
data_comb
```


## Summary
- `case_when()` requires `mutate()` when working with dataframes/tibbles
- `case_when()` can recode **entire values** based on **conditions** (need quotes for conditions and new values)
- remember `case_when()` needs `TRUE ~ varaible` to keep values that aren't specified by conditions, otherwise will be `NA`
- remember `case_when()` needs `.default = varaible` to keep values that aren't specified by conditions, otherwise will be `NA`

**Note:** you might see the `recode()` function, it only does some of what `case_when()` can do, so we skipped it, but it is in the extra slides at the end.

Expand Down Expand Up @@ -825,36 +868,16 @@ knitr::include_graphics("images/case_when.png")

📃 [Posit's `stringr` Cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Image by <a href="https://pixabay.com/users/geralt-9301/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=812226">Gerd Altmann</a> from <a href="https://pixabay.com//?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=812226">Pixabay</a>

# Extra Slides
📃 [Posit's `dplyr` Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)

## `recode()` function

This is similar to `case_when()` but it can't do as much.

::: {style="color: red;"}
(need `mutate` for data frames/tibbles!)
:::
::: codeexample
```{r, eval = FALSE}
# General Format - this is not code!
{data_input} %>%
mutate({variable_to_fix_or_new} = recode({Variable_fixing}, {old_value} = {new_value},
{another_old_value} = {new_value}))

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
:::

## recode() function
## Things you might see

::: {style="color: red;"}
Need quotes for new values! Tolerates quotes for old values.
:::
`recode` function - not as powerful as `case_when`

```{r, eval = FALSE}

Expand Down Expand Up @@ -909,15 +932,6 @@ y
length(y)
```

## A bit on Regular Expressions

- <http://www.regular-expressions.info/reference.html>
- They can use to match a large number of strings in one statement
- `.` matches any single character
- `*` means repeat as many (even if 0) more times the last character
- `?` makes the last thing optional
- `^` matches start of vector `^a` - starts with "a"
- `$` matches end of vector `b$` - ends with "b"

## Let's look at modifiers for `stringr`

Expand Down
6 changes: 3 additions & 3 deletions modules/Data_Cleaning/lab/Data_Cleaning_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ NEW_TIBBLE <- OLD_TIBBLE %>%
mutate(NEW_COLUMN = case_when(
OLD_COLUMN %in% c( ... ) ~ ... ,
OLD_COLUMN %in% c( ... ) ~ ... ,
TRUE ~ OLD_COLUMN
.default = OLD_COLUMN
))
```

Expand All @@ -158,7 +158,7 @@ BloodType <- BloodType %>%
mutate(exposure = case_when(
exposure %in% c("N", "n", "No", "no") ~ "No",
exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
.default = exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))

count(BloodType, exposure)
Expand All @@ -181,7 +181,7 @@ BloodType <- BloodType %>%
mutate(type = case_when(
type == "o.-" ~ "O.-",
type == "o.+" ~ "O.+",
TRUE ~ type))
.default = type))
BloodType
```

Expand Down
6 changes: 3 additions & 3 deletions modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ library(tidyverse)

## Recap of Data Cleaning

- `is.na()`,`any(is.na())`, `all(is.na())`,`count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
- `is.na()`,`any(is.na())`, `count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
- `miss_var_which()` can help you drop columns that have any missing values.
- `filter()` automatically removes `NA` values
- `drop_na()` can help you remove `NA` values
Expand All @@ -28,8 +28,8 @@ library(tidyverse)
- remember `case_when()` needs `TRUE ~ variable` to keep values that aren't specified by conditions, otherwise will be `NA`
- `stringr` package has great functions for looking for specific **parts of values** especially `filter()` and `str_detect()` combined
- also has other useful string manipulation functions like `str_replace()` and more!
- `separate()` can split columns into additional columns
- `unite()` can combine columns
- `separate()` can split columns into additional columns
- `unite()` can combine columns

📃[Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)

Expand Down
2 changes: 1 addition & 1 deletion modules/cheatsheets/Day-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ number) by 0.
|`naniar`| [`pct_complete(x)`](https://www.rdocumentation.org/packages/naniar/versions/0.6.1/topics/pct_complete)|`pct_complete(x)`| Reports the percentage of data that is complete in `x`. |
|`naniar`| [`gg_miss_var(x)`](https://www.rdocumentation.org/packages/naniar/versions/0.6.1/topics/gg_miss_var)|`gg_miss_var(x)`| Reports as a plot the percentage of data that is complete in `x`. |
|`tidyr`| [`drop_na(df)`](https://tidyr.tidyverse.org/reference/drop_na.html)|`drop_na(df)`| Drops rows of `NA` from a given data frame/tibble |
| `dplyr`| [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)| `df <- df %>% mutate(variable_recoded = case_when(variable > 2 ~ "large", TRUE ~ variable) `|This function allows you to recode data based on certain conditions. If no cases match, NA is returned, unless the TRUE statement specifies otherwise.|
| `dplyr`| [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)| `df <- df %>% mutate(variable_recoded = case_when(variable > 2 ~ "large", .default = variable) `|This function allows you to recode data based on certain conditions. If no cases match, NA is returned, unless the .default statement specifies otherwise.|
| `dplyr`| [`mutate()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/mutate)| `df <- mutate(df, newcol = wt/2.2)`| Adds a new column that is a function of existing columns|
| `dplyr`| [`separate()`](https://tidyr.tidyverse.org/reference/separate.html)| `df %>% separate(x, c("A", "B"))`| Separate a character column into multiple columns with a regular expression or numeric locations|
| `dplyr`| [`unite()`](https://tidyr.tidyverse.org/reference/unite.html)| `df %>% unite("z", x:y, remove = FALSE)`| Unite multiple columns together into one column|
Expand Down