有时候我不想评估模型在预测单个观测值方面的性能,而是想评估模型在群体预测方面的性能。rsample中的群体重采样工具(如group_vfold_cv
)非常适合确保所有数据拆分都能将群体保持在一起。但我想评估模型的群体性能,而不是单个观测值的性能。
例如,也许我想用一个模型来预测房地产价格,但我最终要用这个模型来估计一个社区的价值。
以艾姆斯数据集为例,我们可以构建模型来预测房屋的销售价格,但我不是根据模型预测单个房屋的性能来调整模型,而是根据模型预测某个社区的房价总和的性能来调整模型(我假设Ames数据集对于每个社区都是“完整的”)。
我在下面提供了一个示例代码。出于速度的原因,我保持了最小的重采样和网格。
#Load in data and transform Neighborhood variable a little
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>%
step_other(Neighborhood, threshold = .04) %>%
prep() %>%
bake(new_data = df)
#Split data based off nieghborhoods
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
#Simple recipe for modeling Sale_Price
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
#Setting up specification for MARS and RF
mars_earth_spec <-
mars(prod_degree = tune()) %>%
set_engine('earth') %>%
set_mode('regression')
rand_forest_ranger_spec <-
rand_forest(mtry = tune(), min_n = tune()) %>%
set_engine('ranger') %>%
set_mode('regression')
#Setting up the workflow that pairs our recipe with models
no_pre_proc <-
workflow_set(
preproc = list(simple = rec),
models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
)
#Tune the models
grid_ctrl <-
control_grid(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
grid_results <-
no_pre_proc %>%
workflow_map(
seed = 1503,
resamples = df_folds,
grid = 5,
control = grid_ctrl
)
#Ranking the models by RMSE for models based off their performance estimating individual houses
grid_results %>%
rank_results() %>%
filter(.metric == "rmse") %>%
select(model, .config, rmse = mean, rank)
#This is not what I want
#I want to rank the models by RMSE of aggregate predictions per neighborhood against the aggregate sale price
#Maybe I need something like... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Nieghborhood)
我可以评估单个房屋的模型RMSE,但我想评估社区价值的模型RMSE。
1条答案
按热度按时间r3i60tvu1#
对于这个目标没有内置的支持,但是您应该能够手动完成它。
因为
control_grid()
中有save_pred = TRUE
,我们可以用collect_predictions()
和summarize = FALSE
得到所有这些预测。然后是一系列{dplyr}函数和
rmse()
,它们可以应用于分组数据。帧应该会给予你想要的。