如何从使用R中的CARET训练的模型计算95%置信区间？

qxsslcnc 于 2022-12-20 发布在其他

关注(0)|答案(1)|浏览(303)

我已经使用R包caret建立了不同的回归模型。我们如何计算预测的95%置信区间？我已经按照here的讨论进行了讨论，但是，它不起作用。

rm(list = ls())
library(caret)

data("mtcars")
Train_data = mtcars[1:26, -c(8,9)]
Test_data = mtcars[27:32, -c(8,9)]

set.seed(100)
model_pls <- train(
  hp ~ ., 
  data = Train_data, 
  tuneLength = 5, 
  method = "pls", 
  metric = "RMSE", 
  preProcess = c('center', 'scale'), 
  trControl = trainControl(
    method = "repeatedcv", 
    number = 5, 
    repeats = 3, 
    savePredictions = "final"
  )
)

model_rf <- train(
  hp ~ ., 
  data = Train_data, 
  tuneLength = 5, 
  method = "ranger", 
  metric = "RMSE", 
  preProcess = c('center', 'scale'), 
  trControl = trainControl(
    method = "repeatedcv", 
    number = 5, 
    repeats = 3, 
    savePredictions = "final"
  )
)

model_svmr <- train(
  hp ~ ., 
  data = Train_data, 
  tuneLength = 8, 
  method = "svmRadial", 
  metric = "RMSE", 
  preProcess = c('center', 'scale'),
  trControl = trainControl(
    method = "repeatedcv", 
    number = 5, 
    repeats = 3,
  )
)

# This does not generate confidence interval
PLS.pred = predict(model_pls, subset(Test_data, select = -hp))  
RF.pred = predict(model_rf, subset(Test_data, select = -hp)) 
RF.svm = predict(model_svmr , subset(Test_data, select = -hp)) 

# This is not working
predict(model_pls$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_rf$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_svmr$finalModel, subset(Test_data, select = -hp), interval = "confidence")

按照Michael Matta的建议，我尝试了下面的代码，但是，它没有按预期工作。

confint(model_pls, level = 0.95)
# Error in UseMethod("vcov"): no applicable method for 'vcov'

predict(model_pls, subset(Test_data, select = -hp), interval = "confidence")
# 64.47807  57.97479 151.59713 130.24356 183.20296  88.50035
# This does not show the CI.

r

来源：https://stackoverflow.com/questions/74696700/how-to-calculate-the-95-confidence-interval-from-a-model-trained-using-caret-in

1条答案

按热度按时间

yeotifhr1#

置信区间来自已知分布和以下统计量或使用重采样构建。RBF SVM、随机森林等不具有已知分布，即，它们不能以与线性模型（lm）相同的方式在任何事物上产生置信区间。
从这类模型中获取置信区间的方法是对训练/测试数据集进行重采样，重新训练，收集你需要的值（例如，使用for循环），然后，从这样收集的数据中，通过已知的均值分布估计期望值的置信区间。
下面的伪代码几乎适用于您想要的任何分数（准确度、RMSE......;评论见下文）：

predictionsTrainAll <- c()
predictionsTestAll <- c() 
scoresTrain <- c()
scoresTest <- c()

for( i in 1:1000){
    d <- shuffle the original dataset,
    training <- draw training dataset from d,
    testing  <- draw testing datassetfrom d (such that training and testing do not have any intersection),
    
    model <- train a model on training data,
    predictionsTrain <- make predictions for training data,
    predictionsTest  <- make predictions for testing data,
    scoreTrain <- evaulate model and obtain any score you like on train,
    scoreTest  <- evaluate model and obtain any score you like on test,
    
    predictionsTrainAll <- append(predictionsTrainAll, predictionsTrain)
    predictionsTestAll <- append(predictionsTestAll, predictionsTest)
    scoresTrain <- append(scoresTrain, scoreTrain)
    scoresTest  <- append(scoresTest, scoreTest)
}

现在，我们可以估计scoresTrain和scoresTest的期望值。由于中心极限定理，我们可以假设期望值服从正态分布（或t分布，因为我们这里有有限的样本）用途：

# scores should be /somehow/ normally distributed (symmetric by mean, meadian close to the mean)
hist(predictionsTrainAll)
hist(predictionsTestAll)
hist(scoresTrain)    
hist(scoresTest)     

# if the histogram are /somehow/ normal:
t.test(predictionsTrainAll)
t.test(predictionsTestAll)
t.test(scoresTrain)
t.test(scoresTest)

它将计算预测值的期望值（真实平均值）的95%置信区间和您想要的任何分数。但是要注意，如果直方图是倾斜的，平均值的估计可能有缺陷，并产生错误的置信区间。
二元分类器的示例：预测的估计 * 真实平均值 * 为0，95% CI = [-0.32，0.32]，因为模型预测为零。但是，预测值只能介于[0; 1]，因此CI的负值部分没有意义。这种CI是正态/t分布所暗示的对称性的结果。当检查的评分/预测的直方图不呈正态分布时，可能发生这种情况。

赞(0）回复(0）举报 2022-12-20

我来回答

如何从使用R中的CARET训练的模型计算95%置信区间？

1条答案

相关问题

热门标签

最新问答