使用Tidymodels、Workflowsets和Recipes为kmeans聚类调整K

mzmfm0qo  于 2023-03-20  发布在  其他
关注(0)|答案(1)|浏览(147)

我希望使用Tidymodels为K均值聚类选择最佳的K值。我正在探索使用工作流集来提供大量预处理方法,我希望在选择K值时比较它们的性能。
我尝试将this tutorialthis one结合起来,讨论如何使用工作流集来比较模型。
我正在使用mtcars数据,我一直停留在超参数调优部分,在那里我试图收集调优结果。
我被卡在了一个工作流集所在的部分,并将其传递给tune_cluster。我收到了以下错误:
调整群集(): The first argument to [tune_cluster()] should be either a model or workflow.

# INSTALL PACKAGES
pacman::p_load(tidyverse, tidymodels, tidyclust, janitor, ClusterR, knitr, moments, visdat, skimr, DescTools)

mtcars <- mtcars %>%
  mutate(
    `am` = factor(`am`, labels = c(`0` = "auto", `1` = "man")),
    `vs` = factor(`vs`, labels = c(`0` = "V-shaped", `1` = "straight")),
    `cyl` = factor(`cyl`),
    `gear` = factor(`gear`),
    `carb` = factor(`carb`)
  )

# SET UP 10 FOLD CROSS VALIDATION
mtcars_cv <- vfold_cv(mtcars, v = 10)

# SET SEED FOR REPRODUCABILITY
set.seed(123)

# EDA ---------------------------------------------------------------------

#skimr::skim(mtcars)

#DescTools::Desc(mtcars)

# MODEL SPEC --------------------------------------------------------------

kmeans_spec <- k_means(num_clusters = tune())

# PREPROCESSING RECIPES ---------------------------------------------------

rec1 <- recipe(~., data = mtcars) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

rec2 <- recipe(~., data = mtcars) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = 2)

rec3 <- recipe(~., data = mtcars) %>%
  step_log(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

clust_num_grid <- grid_regular(num_clusters(),
  levels = 10
)

# WORKFLOW ----------------------------------------------------------------

wf_set <- workflow_set(
  preproc = list(rec1, rec2, rec3),
  models = list(kmeans_spec)
)

# TUNE HYPER-PARAMETERS ---------------------------------------------------

tune_results <- wf_set %>%
  workflow_map(
    resamples = mtcars_cv,
    grid = clust_num_grid
  ) %>%
  tune_cluster(
    resamples = mtcars_cv,
    grid = clust_num_grid,
    metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
    control = tune::control_grid(save_pred = TRUE, extract = identity)
  )

best_wf <- tune_results %>%
  select_best("sse_ratio")

如能帮助解决这一问题,我们将不胜感激。

x6yk4ghg

x6yk4ghg1#

谢谢你的帖子!Emil(tidyclusters维护者)和我(workflowsets维护者)刚刚聊过这个。
用于调优这些模型的工作流集习惯用法如下所示:

tune_results <-
   wf_set %>% 
   workflow_map(
      "tune_cluster",
      resamples = mtcars_cv,
      grid = clust_num_grid,
      metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
      control = tune::control_grid(save_pred = TRUE, extract = identity)
   )

...但是workflowsets目前阻止你传递"tune_cluster"作为你的调优函数,我在包库中设置了filed an issue来提醒我自己添加对它的支持。
与此同时,您可以使用以下内容来近似此过程:

tune_cluster_wf <- function(id) {
   tune_cluster(
      extract_workflow(wf_set, id),
      resamples = mtcars_cv,
      grid = clust_num_grid,
      metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
      control = tune::control_grid(save_pred = TRUE, extract = identity)
   )
}

wf_set$result <- lapply(wf_set$wflow_id, tune_cluster_wf)

第三个元素从配方中抛出一个错误,但我会让您从那里排除故障。:)

相关问题