我是一名中国学生,刚接触mahout(请原谅我的英语不好:-p)我有大约数千篇格式化的中文文章在一个文件中,我想把它们分类。
(mahout 1.0 hadoop 2.5.1)
首先,我用
SequenceFile.Writer writer = SequenceFile.createWriter(conf,
Writer.file(outPath), Writer.keyClass(Text.class),
Writer.valueClass(Text.class));
File[] news = input.listFiles();
BufferedReader br;
String data;
Pattern pattern = Pattern.compile("^.*?@(?!http).*?@.*?(?=\\t)");
Matcher matcher;
for (int i = 0; i < news.length; i++) {
br = new BufferedReader(new FileReader(news[i]));
while ((data = br.readLine()) != null) {
matcher = pattern.matcher(data);
if (matcher.find()) {
writer.append(new Text(matcher.group()), new Text(data //matcher.group() returns title
.replaceAll("^.*?@.*?@.*?\\t|http.*?$", "") //the value is content
.replaceAll("@|\\s*", " ")));
}
}
br.close();
}
writer.sync();
writer.close();
然后我得到了包含所有文章的序列文件。
下一个代码
int minSupport = 2;
int minDf = 1;
int maxDFPercent = 96;
int maxNGramSize = 1;
float minLLRValue = LLRReducer.DEFAULT_MIN_LLR;
int reduceTasks = 1;
int chunkSize = 200;
float norm = 2;
boolean sequentialAccessOutput = false;
boolean namedVector = false;
boolean logNormalize = false;
//here I neglect something inessential
Class<? extends Analyzer> analyzerClass = IKAnalyzer.class;
DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzerClass.asSubclass(Analyzer.class), tokenizedPath, conf);
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
new Path(outputDir), tfDirName, conf, minSupport, maxNGramSize,
minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVector);
Pair<Long[], List<Path>> docFrequenciesFeatures = TFIDFConverter
.calculateDF(new Path(tfDirName), new Path(outputDir), conf,
chunkSize);
TFIDFConverter.processTfIdf(new Path(tfDirName), new Path(outputDir),
conf, docFrequenciesFeatures, minDf, maxDFPercent, norm,
logNormalize, sequentialAccessOutput, namedVector, reduceTasks);
Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
Path canopyCentroids = new Path(outputDir, "canopy-centroids");
Path clusterOutput = new Path(outputDir, "clusters");
CanopyDriver.run(conf, vectorsFolder, canopyCentroids,
new CosineDistanceMeasure(), 0.7, 0.3, true, 0.1, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput, 0.01, 20, true, 0.1, false);
几分钟后,程序继续运行到tfidfconverter.processtfidf(…),然后processtfidf完成。我得到了文件part-r-00000,它的大小只有90b。
有人知道我犯的错误吗?非常感谢:)
暂无答案!
目前还没有任何答案,快来回答吧!