如何在R中实现朴素贝叶斯分类算法的ROC曲线分析？

7dl7o3gd 于 2023-04-18 发布在其他

关注(0)|答案(2)|浏览(249)

网上有非常复杂的例子，我无法将它们应用到我的代码中，我有一个由14个自变量和1个因变量组成的数据集，我正在用R进行分类，下面是我的代码：

dataset <- read.table("adult.data", sep = ",", na.strings = c(" ?"))
colnames(dataset) <- c( "age",
                        "workclass",
                        "fnlwgt",
                        "education",
                        "education.num",
                        "marital.status",
                        "occupation",
                        "relationship",
                        "race",
                        "sex",
                        "capital.gain",
                        "capital.loss",
                        "hours.per.week",
                        "native.country",
                        "is.big.50k")
dataset = na.omit(dataset)

library(caret)
set.seed(1)
traning.indices <- createDataPartition(y = dataset$is.big.50k, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,-15],
                                    y = training.set$is.big.50k)

prediction = predict(classifier, newdata = test.set[,-15])

cm <- confusionMatrix(data = prediction, reference = test.set[,15],
                      positive = levels(test.set$is.big.50k)[2])

accuracy <- sum(diag(as.matrix(cm))) / sum(as.matrix(cm))

sensitivity <- sensitivity(prediction, test.set[,15],
                           positive = levels(test.set$is.big.50k)[2])

specificity <- specificity(prediction, test.set[,15],
                           negative = levels(test.set$is.big.50k)[1])

我试过了。它工作了。有错误吗？转换过程中有问题吗？（在as.numeric（）方法上）

library(ROCR)
pred <- prediction(as.numeric(prediction), as.numeric(test.set[,15]))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for NB",
     col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

来源：https://stackoverflow.com/questions/47883541/how-can-i-implement-roc-curve-analysis-for-this-naive-bayes-classification-algor

2条答案

按热度按时间

fkaflof61#

要使ROC曲线起作用，您需要一些阈值或超参数。
贝叶斯分类器的数字输出往往太不可靠（而二元决策通常是可以的），并且没有明显的超参数。您可以尝试将先验概率（仅在二元问题中！）作为参数，并绘制ROC曲线。
但无论如何，要使 curve 存在，你需要一个从某个曲线参数t到TPR，FPR的Map来得到曲线。例如，t可以是你的先验。

赞(0）回复(0）举报 2023-04-18

1zmg4dgp2#

试试这个：

set.seed(1)
library(data.table)
amount = 100
dataset = data.table(
  x = runif(amount, -1, 1)
  ,y = runif(amount, -1, 1)
)
# inside the circle with radius 0.5? -> true, otherwise false
dataset = dataset[, target := (sqrt(x^2 + y^2) < 0.5)]
plot(dataset[target == F]$x, dataset[target == F]$y, col="red", xlim = c(-1, 1), ylim = c(-1, 1))
points(dataset[target == T]$x, dataset[target == T]$y, col="green")

library(caret)

traning.indices <- createDataPartition(y = dataset$target, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,.(x,y)],
                        y = training.set$target)

prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw")
prediction = prediction[, 2]
test.set = test.set[, prediction := prediction]

TPrates = c()
TNrates = c()
thresholds = seq(0, 1, by = 0.1)
for (threshold in thresholds) {
  # percentage of correctly classified true examples
  TPrateForThisThreshold = test.set[target == T & prediction > threshold, .N]/test.set[target == T, .N]
  # percentage of correctly classified false examples
  TNrateForThisThreshold = test.set[target == F & prediction <= threshold, .N]/test.set[target == F, .N]

  TPrates = c(TPrates, TPrateForThisThreshold)
  TNrates = c(TNrates, TNrateForThisThreshold)
}

plot(1-TNrates, TPrates, type="l")

备注：
只有在有“数字概率”预测时才能绘制ROC曲线（即0和1之间的数字），即使你想预测只能是TRUE或FALSE的东西！--〉我们需要在预测行prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw")中放置'type=“raw”'，这样预测就不会是'TRUE'或'FALSE'，而是0和1之间的数字，并且具有TRUE/“数值预测〉= 0.5”之前为FALSE，即如果概率超过阈值，则预测为“TRUE”，否则为“FALSE”。
谁告诉我们'0.5'是我们预测值的正确值？它不能是0.7或0.1吗？正确！我们不知道（临时的，没有更多关于这个问题的知识）哪个阈值是正确的。这就是为什么我们只是'尝试所有的'（我只试过0，0. 1，0. 2，...，0. 9，1）并使用这些阈值中的每一个创建混淆矩阵。通过这种方式，我们可以看到预测器如何独立于阈值*执行。如果线'bows much'进入完美分类器的方向（矩形，即仅100%召回率，具有0%的1-特异性），分类器执行得越好。
解释斧头！！！
Y轴表示：预测器检测到了多少实际上是积极的例子？
X轴表示：预测者花在预测上的钱有多浪费？
也就是说，如果你想达到一个良好的检测率的真实的例子（例如，当预测一种疾病时，你必须确保每一个实际患有这种疾病的病人都会被真正检测到，否则预测器的全部意义就被撤回了）。然而，仅仅预测每个人都是“真的”并没有帮助！要么治疗可能是有害的，要么它只是昂贵的。因此，我们有两个对立的参与者（recall =检测到的正确率，1-spec =预测因子的“浪费”率），ROC曲线上的每个点都是一个可能的预测因子。现在你必须选择ROC曲线上你想要的点，检查导致这个点的阈值，并在最后使用这个阈值。

赞(0）回复(0）举报 2023-04-18

我来回答

如何在R中实现朴素贝叶斯分类算法的ROC曲线分析？

2条答案

相关问题

热门标签

最新问答