Description
Currently, case_when()
does not provide a built-in way to validate categorical inputs and throw an error when an unexpected value is encountered. The function requires all return values to have the same type, making it impossible to safely use in cases where an unexpected value is encountered. The function is also incompatible in most cases with stop()
.
This makes case_when()
unsafe in cases where developers need both:
- A normal transformation for known values
- A hard error for unknown values
Reproducible Example:
library(dplyr)
replace_func <- function(x) {
case_when(
x == "A" ~ 1,
x == "B" ~ 2,
x == "C" ~ 3,
# If there is a different value I want the function to throw an error
# and stop the execution
.default = stop(paste0("Invalid value", x))
)
data <- tibble(x = c("A", "B", "A", "C"))
# This will throw an error - even though all values are specified in the function
data %>% mutate(new_x = replace_func(x))
# Expected behavior would be to return something like:
# A tibble: 4 x 2
# x new_x
# <chr> <dbl>
# 1 A 1
# 2 B 2
# 3 A 1
# 4 C 3
# But for it to fail if there is a value not specified in the function
data1 <- tibble(x = c("A", "B", "A", "C", "D"))
# This should throw an error because the default value is stop() and the value
# "D" is not specified in the function
data1 %>% mutate(new_x = replace_func(x))
Currently, the only alternatives for handling unknown values in case_when()
are:
- A manual check after executing
case_when()
, which is an imperfect solution with unnecessary complexity or - Leaving
.default = NA
, which can lead to silent failures—an unknown value that should have been handled explicitly might be mistakenly transformed intoNA
instead of triggering an error.
Neither of these solutions is ideal.
Proposed Solution
I believe the default behavior should be something along the lines of .default = stop(paste0("Unknown value: ", x))
. This would force users to explicitly handle unknown values within their program, ensuring safer data transformations. If users want to allow unknown values to default to NA
, they should be required to specify it explicitly by using .default = NA
or TRUE ~ NA
. This approach would provide better safety by default, preventing unintended NA
values from propagating due to missing mappings in case_when()
.
Would love to hear your thoughts on this!