Description
First of all, this tutorial is amazing. Content, pace, level of detail. I love it.
I encountered one issue with random forest when following the code of Chapter 1 locally. I'm on tidymodels 0.1.2 and randomForest 4.6-14 running in Windows.
While I found a solution by mutating chr cols in car_vars
to factor, I have no idea why the code that works on netlify did not work locally.
Running predict(fit_rf, car_train)
returned:
Error in predict.randomForest(object = object$fit, newdata = new_data) :
New factor levels not present in the training data
To reproduce:
install.packages(c("tidymodels","randomForest"))
library(tidymodels)
csv_url <- "https://raw.githubusercontent.com/juliasilge/supervised-ML-case-studies-course/master/data/cars2018.csv"
download.file(csv_url,"cars.csv")
cars <- readr::read_csv("cars.csv")
set.seed(1234)
car_vars <- cars %>%
select(-model, -model_index)
car_split <- car_vars %>%
initial_split(prop = 0.8,
strata = aspiration)
car_train <- training(car_split)
rf_mod <- rand_forest() %>%
set_mode("regression") %>%
set_engine("randomForest")
fit_rf <- rf_mod %>%
fit(log(mpg) ~ .,
data = car_train)
results <- car_train %>%
mutate(mpg = log(mpg)) %>%
bind_cols(predict(fit_rf, car_train) %>%
rename(.pred_rf = .pred))
What I noticed is that all levels in str(fit_rf[["fit"]][["forest"]][["xlevels"]])
were numeric (contrary to the model stored in data/c1_fit_rf.rds
. Maybe someone here could explain me why, since randomForest is new to me?
Solution to the error was to enforce factor class on chrs: car_vars <- mutate(car_vars, across(where(is.character),as.factor))