使用尺度计算每组预测聚合的RMSE

pkln4tw6  于 2023-03-05  发布在  其他
关注(0)|答案(1)|浏览(140)

有时候我不想评估模型在预测单个观测值方面的性能,而是想评估模型在群体预测方面的性能。rsample中的群体重采样工具(如group_vfold_cv)非常适合确保所有数据拆分都能将群体保持在一起。但我想评估模型的群体性能,而不是单个观测值的性能。
例如,也许我想用一个模型来预测房地产价格,但我最终要用这个模型来估计一个社区的价值。
以艾姆斯数据集为例,我们可以构建模型来预测房屋的销售价格,但我不是根据模型预测单个房屋的性能来调整模型,而是根据模型预测某个社区的房价总和的性能来调整模型(我假设Ames数据集对于每个社区都是“完整的”)。
我在下面提供了一个示例代码。出于速度的原因,我保持了最小的重采样和网格。

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>% 
  step_other(Neighborhood, threshold = .04) %>% 
  prep() %>% 
  bake(new_data = df)

#Split data based off nieghborhoods
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)

#Simple recipe for modeling Sale_Price
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)

#Setting up specification for MARS and RF
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')

#Setting up the workflow that pairs our recipe with models
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )

#Tune the models
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )

#Ranking the models by RMSE for models based off their performance estimating individual houses
grid_results %>% 
  rank_results() %>% 
  filter(.metric == "rmse") %>% 
  select(model, .config, rmse = mean, rank)
#This is not what I want
#I want to rank the models by RMSE of aggregate predictions per neighborhood against the aggregate sale price
#Maybe I need something like... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Nieghborhood)

我可以评估单个房屋的模型RMSE,但我想评估社区价值的模型RMSE。

r3i60tvu

r3i60tvu1#

对于这个目标没有内置的支持,但是您应该能够手动完成它。
因为control_grid()中有save_pred = TRUE,我们可以用collect_predictions()summarize = FALSE得到所有这些预测。
然后是一系列{dplyr}函数和rmse(),它们可以应用于分组数据。帧应该会给予你想要的。

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>% 
  step_other(Neighborhood, threshold = .04) %>% 
  prep() %>% 
  bake(new_data = df)

#Split data based off nieghborhoods
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)

#Simple recipe for modeling Sale_Price
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)

#Setting up specification for MARS and RF
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')

#Setting up the workflow that pairs our recipe with models
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )

#Tune the models
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
#> i Creating pre-processing data to finalize unknown parameter: mtry

grid_results %>%
  collect_predictions(summarize = FALSE) %>%
  mutate(Neighborhood = df_train$Neighborhood[.row]) %>%
  group_by(id, model, .config, Neighborhood) %>%
  summarise(Sale_Price = sum(Sale_Price), .pred = sum(.pred), .groups = "drop") %>%
  group_by(id, model, .config) %>%
  rmse(truth = Sale_Price, estimate = .pred) %>%
  group_by(model, .config) %>%
  summarize(mean_rmse = mean(.estimate), .groups = "drop") %>%
  arrange(mean_rmse)
#> # A tibble: 7 × 3
#>   model       .config              mean_rmse
#>   <chr>       <chr>                    <dbl>
#> 1 rand_forest Preprocessor1_Model1  2667177.
#> 2 mars        Preprocessor1_Model2  2695526.
#> 3 rand_forest Preprocessor1_Model4  2819628.
#> 4 rand_forest Preprocessor1_Model5  2824109.
#> 5 rand_forest Preprocessor1_Model3  2845252.
#> 6 rand_forest Preprocessor1_Model2  3059321.
#> 7 mars        Preprocessor1_Model1  3563432.

相关问题