-
Notifications
You must be signed in to change notification settings - Fork 13
Description
The problem
I'm working with a dataset where I use importance weights to specify the misclassification costs of instances. Because the target class in the dataset is severly unbalanced, I would like to use some resample (e.g. SMOTE) to mitigate this issue. However, step_smote() and friends do not impute the importance weights, and I cannot impute them later because this is not allowed by other step_impute methods.
I can understand that the default behaviour should not be to generate new weights as well as this might lead to unexpected behaviour, but I do not see why the algorithms in this package would be unable to do this at all.
Reproducible example
Here is an example using hpc_data:
library(tidymodels)
library(themis)
# First, get rid of the nominal predictors as these cannot be used by `step_smote`
hpc_data <- select(hpc_data, -c(protocol, day))
# Now specify the importance weights, for example input_fields
hpc_data <- mutate(hpc_data, input_fields = importance_weights(input_fields))
# Specify a simple recipe to use with `step_smote`
rec <- recipe(class ~ ., data = hpc_data) |>
step_smote(class)
# Now prep and bake as training data to see the result
rec |>
prep() |>
bake(NULL)
#> # A tibble: 8,844 × 6
#> compounds input_fields iterations num_pending hour class
#> <dbl> <imp_wts> <dbl> <dbl> <dbl> <fct>
#> 1 997 137 20 0 14 F
#> 2 97 103 20 0 13.8 VF
#> 3 101 75 10 0 13.8 VF
#> 4 93 76 20 0 10.1 VF
#> 5 100 82 20 0 10.4 VF
#> 6 100 82 20 0 16.5 VF
#> 7 105 88 20 0 16.4 VF
#> 8 98 95 20 0 16.7 VF
#> 9 101 91 20 0 16.2 VF
#> 10 95 92 20 0 10.8 VF
#> # ℹ 8,834 more rows
# This would leave us with 8844 rows, but there are many missing values in input_fields
rec |>
prep() |>
bake(NULL) |>
drop_na(input_fields) # Only 4331 rows left, the same amount as the original dataset
#> # A tibble: 4,331 × 6
#> compounds input_fields iterations num_pending hour class
#> <dbl> <imp_wts> <dbl> <dbl> <dbl> <fct>
#> 1 997 137 20 0 14 F
#> 2 97 103 20 0 13.8 VF
#> 3 101 75 10 0 13.8 VF
#> 4 93 76 20 0 10.1 VF
#> 5 100 82 20 0 10.4 VF
#> 6 100 82 20 0 16.5 VF
#> 7 105 88 20 0 16.4 VF
#> 8 98 95 20 0 16.7 VF
#> 9 101 91 20 0 16.2 VF
#> 10 95 92 20 0 10.8 VF
#> # ℹ 4,321 more rows
# On the other hand, `step_upsample()` does work
rec <- recipe(class ~ ., data = hpc_data) |>
step_upsample(class)
rec |>
prep() |>
bake(NULL) |>
drop_na(input_fields)
#> # A tibble: 8,844 × 6
#> compounds input_fields iterations num_pending hour class
#> <dbl> <imp_wts> <dbl> <dbl> <dbl> <fct>
#> 1 97 103 20 0 13.8 VF
#> 2 101 75 10 0 13.8 VF
#> 3 93 76 20 0 10.1 VF
#> 4 100 82 20 0 10.4 VF
#> 5 100 82 20 0 16.5 VF
#> 6 105 88 20 0 16.4 VF
#> 7 98 95 20 0 16.7 VF
#> 8 101 91 20 0 16.2 VF
#> 9 95 92 20 0 10.8 VF
#> 10 102 96 20 0 9.97 VF
#> # ℹ 8,834 more rowsCreated on 2024-04-15 with reprex v2.1.0
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.3 (2024-02-29 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United Kingdom.utf8
#> ctype English_United Kingdom.utf8
#> tz Europe/Brussels
#> date 2024-04-15
#> pandoc 3.1.1 @ C:/Workdir/MyApps/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
#> broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.1)
#> class 7.3-22 2023-05-03 [1] CRAN (R 4.3.0)
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2)
#> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.2)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2)
#> data.table 1.15.0 2024-01-30 [1] CRAN (R 4.3.2)
#> dials * 1.2.1 2024-02-22 [1] CRAN (R 4.3.2)
#> DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.3.2)
#> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
#> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.1.3)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
#> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.1)
#> future 1.33.1 2023-12-22 [1] CRAN (R 4.3.2)
#> future.apply 1.11.1 2023-12-21 [1] CRAN (R 4.3.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.2)
#> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.2)
#> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2)
#> gower 1.0.1 2022-12-22 [1] CRAN (R 4.2.2)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.0)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.1)
#> hardhat 1.3.1 2024-02-02 [1] CRAN (R 4.3.2)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
#> infer * 1.0.6 2024-01-31 [1] CRAN (R 4.3.2)
#> ipred 0.9-14 2023-03-09 [1] CRAN (R 4.2.2)
#> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.1.3)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2)
#> lattice 0.22-5 2023-10-24 [1] CRAN (R 4.3.2)
#> lava 1.7.3 2023-11-04 [1] CRAN (R 4.3.2)
#> lhs 1.1.6 2022-12-17 [1] CRAN (R 4.2.2)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
#> listenv 0.9.1 2024-01-29 [1] CRAN (R 4.3.2)
#> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.2)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
#> MASS 7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.2)
#> Matrix 1.6-5 2024-01-11 [1] CRAN (R 4.3.2)
#> modeldata * 1.3.0 2024-01-21 [1] CRAN (R 4.3.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
#> nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.0)
#> parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.3.2)
#> parsnip * 1.2.0 2024-02-16 [1] CRAN (R 4.3.2)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.2)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.1)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.3.2)
#> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.2)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> RANN 2.6.1 2019-01-08 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.2)
#> recipes * 1.0.10 2024-02-18 [1] CRAN (R 4.3.2)
#> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.3.2)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.2)
#> ROSE 0.0-4 2021-06-14 [1] CRAN (R 4.3.3)
#> rpart 4.1.23 2023-12-05 [1] CRAN (R 4.3.2)
#> rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.3.1)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
#> scales * 1.3.0 2023-11-28 [1] CRAN (R 4.3.2)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1)
#> survival 3.5-8 2024-02-14 [1] CRAN (R 4.3.2)
#> themis * 1.0.2 2023-08-14 [1] CRAN (R 4.3.3)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)
#> tidymodels * 1.1.1 2023-08-24 [1] CRAN (R 4.3.1)
#> tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
#> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2)
#> timeDate 4032.109 2023-12-14 [1] CRAN (R 4.3.2)
#> tune * 1.1.2 2023-08-23 [1] CRAN (R 4.3.1)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
#> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2)
#> workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.3.2)
#> workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.2.3)
#> xfun 0.42 2024-02-08 [1] CRAN (R 4.3.2)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
#> yardstick * 1.3.0 2024-01-19 [1] CRAN (R 4.3.2)
#>
#> [1] C:/Workdir/MyApps/R-Library/4.0
#> [2] C:/Workdir/MyApps/R/R-4.3.3/library
#>
#> ──────────────────────────────────────────────────────────────────────────────Proposed solution
I'm wondering whether this behaviour can be implemented in the functions in this work package, if necessary not as the default behaviour. If there is some other solution that I've missed, I'd be more than happy to learn more about it.