我已经使用R包caret
建立了不同的回归模型。我们如何计算预测的95%置信区间?我已经按照here的讨论进行了讨论,但是,它不起作用。
rm(list = ls())
library(caret)
data("mtcars")
Train_data = mtcars[1:26, -c(8,9)]
Test_data = mtcars[27:32, -c(8,9)]
set.seed(100)
model_pls <- train(
hp ~ .,
data = Train_data,
tuneLength = 5,
method = "pls",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 3,
savePredictions = "final"
)
)
model_rf <- train(
hp ~ .,
data = Train_data,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 3,
savePredictions = "final"
)
)
model_svmr <- train(
hp ~ .,
data = Train_data,
tuneLength = 8,
method = "svmRadial",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 3,
)
)
# This does not generate confidence interval
PLS.pred = predict(model_pls, subset(Test_data, select = -hp))
RF.pred = predict(model_rf, subset(Test_data, select = -hp))
RF.svm = predict(model_svmr , subset(Test_data, select = -hp))
# This is not working
predict(model_pls$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_rf$finalModel, subset(Test_data, select = -hp), interval = "confidence")
predict(model_svmr$finalModel, subset(Test_data, select = -hp), interval = "confidence")
按照Michael Matta的建议,我尝试了下面的代码,但是,它没有按预期工作。
confint(model_pls, level = 0.95)
# Error in UseMethod("vcov"): no applicable method for 'vcov'
predict(model_pls, subset(Test_data, select = -hp), interval = "confidence")
# 64.47807 57.97479 151.59713 130.24356 183.20296 88.50035
# This does not show the CI.
1条答案
按热度按时间yeotifhr1#
置信区间来自已知分布和以下统计量或使用重采样构建。RBF SVM、随机森林等不具有已知分布,即,它们不能以与线性模型(lm)相同的方式在任何事物上产生置信区间。
从这类模型中获取置信区间的方法是对训练/测试数据集进行重采样,重新训练,收集你需要的值(例如,使用for循环),然后,从这样收集的数据中,通过已知的均值分布估计期望值的置信区间。
下面的伪代码几乎适用于您想要的任何分数(准确度、RMSE......;评论见下文):
现在,我们可以估计scoresTrain和scoresTest的期望值。由于中心极限定理,我们可以假设期望值服从正态分布(或t分布,因为我们这里有有限的样本)用途:
它将计算预测值的期望值(真实平均值)的95%置信区间和您想要的任何分数。但是要注意,如果直方图是倾斜的,平均值的估计可能有缺陷,并产生错误的置信区间。
二元分类器的示例:预测的估计 * 真实平均值 * 为0,95% CI = [-0.32,0.32],因为模型预测为零。但是,预测值只能介于[0; 1],因此CI的负值部分没有意义。这种CI是正态/t分布所暗示的对称性的结果。当检查的评分/预测的直方图不呈正态分布时,可能发生这种情况。