R中的分层k重交叉验证

ruyhziif  于 2023-01-10  发布在  其他
关注(0)|答案(3)|浏览(126)

假设我有一个多类数据集(例如iris),我想执行分层10倍CV来测试模型性能,我在splitstackchange包中找到了一个名为stratified的函数,它可以根据我想要的数据比例提供分层倍数,所以如果我想要测试倍数,它将是数据行的0.1。

#One Fold
library(splitstackchange)
stratified(iris,c("Species"),0.1)

我想知道如何在一个10倍的循环中实现这个函数或任何其他形式的分层cv。我无法破解它背后的逻辑。这里我包括一个可重复的例子。

library(splitstackshape)
    data=iris
    names(data)[ncol(data)]=c("Y")
    nFolds=10

    for (i in 1:nFolds){
      testing=stratified(data,c("Y"),0.1,keep.rownames=TRUE)
      rn=testing$rn
      testing=testing[,-"rn"]
      row.names(testing)=rn
      trainingRows=setdiff(1:nrow(data),as.numeric(row.names(testing)))
      training=data[trainingRows,]
      names(training)[ncol(training)]="Y"
    }
slmsl1lt

slmsl1lt1#

使用插入符号包n倍简历。我会建议this非常翔实的链接插入符号。
您可能发现以下解决方案很有用。

library(tidyverse)
library(splitstackshape)
library(caret)
library(randomForest)

data=iris

## split data into train and test using stratified sampling
d <- rownames_to_column(data, var = "id") %>% mutate_at(vars(id), as.integer)
training <- d %>% stratified(., group = "Species", size = 0.90)
dim(training)

## proportion check
prop.table(table(training$Species)) 

testing <- d[-training$id, ]
dim(testing)
prop.table(table(testing$Species)) 

## Modelling

set.seed(123)

tControl <- trainControl(
  method = "cv", #cross validation
  number = 10, #10 folds
  search = "random" #auto hyperparameter selection
)

trRf <- train(
  Species ~ ., #formulae
  data = training[,-1], #data without id field
  method = "rf", # random forest model
  trControl = tControl # train control from previous step.
)
qgelzfjb

qgelzfjb2#

现在已经很晚了,但是我希望我能帮助到一些人。下面的示例代码可以帮助我:

library(splitstackshape)

dat1 <- data.frame(ID = 1:100,
              A = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace = TRUE),
              B = rnorm(100), C = abs(round(rnorm(100), digits=1)),
              D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
              E = sample(c("M", "F"), 100, replace = TRUE))


flds=list()
dat=dat1

for(i in 1:10){
  j=10-(i-1)
  if(j>1){
  a=stratified(dat, c("E", "D"), size = 1/j)
  flds[[i]]=a$ID
  dat=dat%>%filter(ID %in% setdiff(dat$ID,a$ID))
  } else{
  flds[[i]]=dat$ID  
  }
}
xj3cbfub

xj3cbfub3#

caret封装是个不错的选择。
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)将返回10倍CV的测试倍数y的指数。
createMultiFolds(y, k = 10, times = 5)将返回5倍10倍CV的测试倍数y的指数。
根据标签y对数据进行分层。
了解更多信息:https://www.r-bloggers.com/2020/11/caretcreatefolds-vs-createmultifolds/

相关问题