$ mahout seq2sparse
-i ${PATH_TO_SEQUENCE_FILES}
-o ${PATH_TO_TFIDF_VECTORS}
-nv
-n 2
-wt tfidf
Training: The model is then trained using mahout spark-trainnb. The default is to train a Bayes model. The -c option is given to train
cbayes模型:
$ mahout spark-trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${PATH_TO_MODEL}
-ow
-c
Label Assignment/Testing: Classification and testing on a holdout set can then be performed via mahout spark-testnb. Again, the -c
/**Read the training set from inputPath/part-x-00000 sequence file of form <Text,VectorWritable> */
private def readTrainingSet: DrmLike[_]= {
val inputPath = parser.opts("input").asInstanceOf[String]
val trainingSet= drm.drmDfsRead(inputPath)
trainingSet
}
override def process(): Unit = {
start()
val complementary = parser.opts("trainComplementary").asInstanceOf[Boolean]
val outputPath = parser.opts("output").asInstanceOf[String]
val trainingSet = readTrainingSet
val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(trainingSet)
val model = NaiveBayes.train(aggregatedObservations, labelIndex)
model.dfsWrite(outputPath)
stop()
}
/**
* Load DRM from hdfs (as in Mahout DRM format)
*
* @param path
* @param sc spark context (wanted to make that implicit, doesn't work in current version of
* scala with the type bounds, sorry)
*
* @return DRM[Any] where Any is automatically translated to value type
*/
def drmDfsRead (path: String, parMin:Int = 0)(implicit sc: DistributedContext): CheckpointedDrm[_] = {
val drmMetadata = hdfsUtils.readDrmHeader(path)
val k2vFunc = drmMetadata.keyW2ValFunc
// Load RDD and convert all Writables to value types right away (due to reuse of writables in
// Hadoop we must do it right after read operation).
val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], minPartitions = parMin)
// Immediately convert keys and value writables into value types.
.map { case (wKey, wVec) => k2vFunc(wKey) -> wVec.get()}
// Wrap into a DRM type with correct matrix row key class tag evident.
drmWrap(rdd = rdd, cacheHint = CacheHint.NONE)(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])
}
1条答案
按热度按时间yjghlzjz1#
这些信息在文件中很容易获得。
预处理:对于路径\u到\u sequence \u文件中的一组序列文件格式的文档,mahout seq2sparse命令执行tf-idf转换(-wt-tfidf选项)和l2长度规范化(-n2选项),如下所示:
cbayes模型:
选项表示模型为cbayes:
在看
mahout
命令脚本,我们看到它实际上使用org.apache.mahout.drivers.TrainNBDriver
班级。我们对使用TFIDF
类型向量<Text, VectorWritable>
:如果我们仔细观察,就会发现输入是由
drm.drmDfsRead(inputPath)
打电话。然后将这样进行转换(来自sparkengine绑定的示例)