我希望使用Tidymodels为K均值聚类选择最佳的K值。我正在探索使用工作流集来提供大量预处理方法,我希望在选择K值时比较它们的性能。
我尝试将this tutorial与this one结合起来,讨论如何使用工作流集来比较模型。
我正在使用mtcars数据,我一直停留在超参数调优部分,在那里我试图收集调优结果。
我被卡在了一个工作流集所在的部分,并将其传递给tune_cluster。我收到了以下错误:
调整群集(): The first argument to [tune_cluster()] should be either a model or workflow.
# INSTALL PACKAGES
pacman::p_load(tidyverse, tidymodels, tidyclust, janitor, ClusterR, knitr, moments, visdat, skimr, DescTools)
mtcars <- mtcars %>%
mutate(
`am` = factor(`am`, labels = c(`0` = "auto", `1` = "man")),
`vs` = factor(`vs`, labels = c(`0` = "V-shaped", `1` = "straight")),
`cyl` = factor(`cyl`),
`gear` = factor(`gear`),
`carb` = factor(`carb`)
)
# SET UP 10 FOLD CROSS VALIDATION
mtcars_cv <- vfold_cv(mtcars, v = 10)
# SET SEED FOR REPRODUCABILITY
set.seed(123)
# EDA ---------------------------------------------------------------------
#skimr::skim(mtcars)
#DescTools::Desc(mtcars)
# MODEL SPEC --------------------------------------------------------------
kmeans_spec <- k_means(num_clusters = tune())
# PREPROCESSING RECIPES ---------------------------------------------------
rec1 <- recipe(~., data = mtcars) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
rec2 <- recipe(~., data = mtcars) %>%
step_novel(all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = 2)
rec3 <- recipe(~., data = mtcars) %>%
step_log(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
clust_num_grid <- grid_regular(num_clusters(),
levels = 10
)
# WORKFLOW ----------------------------------------------------------------
wf_set <- workflow_set(
preproc = list(rec1, rec2, rec3),
models = list(kmeans_spec)
)
# TUNE HYPER-PARAMETERS ---------------------------------------------------
tune_results <- wf_set %>%
workflow_map(
resamples = mtcars_cv,
grid = clust_num_grid
) %>%
tune_cluster(
resamples = mtcars_cv,
grid = clust_num_grid,
metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
control = tune::control_grid(save_pred = TRUE, extract = identity)
)
best_wf <- tune_results %>%
select_best("sse_ratio")
如能帮助解决这一问题,我们将不胜感激。
1条答案
按热度按时间x6yk4ghg1#
谢谢你的帖子!Emil(tidyclusters维护者)和我(workflowsets维护者)刚刚聊过这个。
用于调优这些模型的工作流集习惯用法如下所示:
...但是workflowsets目前阻止你传递
"tune_cluster"
作为你的调优函数,我在包库中设置了filed an issue来提醒我自己添加对它的支持。与此同时,您可以使用以下内容来近似此过程:
第三个元素从配方中抛出一个错误,但我会让您从那里排除故障。:)