scala—使用分区内的数据训练分类器

luaexgnf 于 2021-05-29 发布在 Spark

关注(0)|答案(1)|浏览(357)

当分类器的分类算法依赖于分区索引时，如何用分区内的示例来训练分类器？例如，假设以下代码段：

val data = MLUtils.loadLibSVMFile(sc, "path to SVM file")
val r = data.mapPartitionsWithIndex((index,localdata)=>{
  if (index % 2 == 0)
  {
    // train a NaiveBayes with localdata
    NaiveBayes.train(localdata)    // Error => found:iterator[LabeledPoint] , required: RDD[labeledPoint]
  }
  else
  {
    // train a DecisionTree classifier with localdata
    DecisionTree.train(localdata)    // Error => found:iterator[LabeledPoint] , required: RDD[labeledPoint]
  }
})

在我看来，这个错误是对的，因为这些任务是在各自独立的jvm中执行的，不能从map任务中分发。这就是为什么我不能在我的任务中访问sparkcontext。但是，有没有人有其他的建议来实现我的目标？

mapreduce scala apache-spark

来源：https://stackoverflow.com/questions/62244864/training-a-classifier-with-data-within-a-partition

1条答案

按热度按时间

vsdwdz231#

基于以上评论部分的讨论，你可以试试这个-

val rdd = MLUtils.loadLibSVMFile(sc, "path to SVM file")

    // approach -1
    val nb = rdd.sample(withReplacement = false, fraction = 0.5) // sample 50% of the record
    val dt = rdd.sample(withReplacement = false, fraction = 0.5) // sample 50% of the record

    //or approach-2 
    val (nb, dt) = rdd.randomSplit(Array(0.5, 0.5))

    // apply algo
    NaiveBayes.train(nb)
    DecisionTree.train(dt, strategy= ..)

赞(0）回复(0）举报 2021-05-29

我来回答

scala—使用分区内的数据训练分类器

1条答案

相关问题

热门标签

最新问答