mahout:after calling tfidfconverter.processtfidf(…)我什么都没有

xoshrz7s  于 2021-05-30  发布在  Hadoop
关注(0)|答案(0)|浏览(218)

我是一名中国学生,刚接触mahout(请原谅我的英语不好:-p)我有大约数千篇格式化的中文文章在一个文件中,我想把它们分类。
(mahout 1.0 hadoop 2.5.1)
首先,我用

SequenceFile.Writer writer = SequenceFile.createWriter(conf,
            Writer.file(outPath), Writer.keyClass(Text.class),
            Writer.valueClass(Text.class));

    File[] news = input.listFiles();
    BufferedReader br;
    String data;
    Pattern pattern = Pattern.compile("^.*?@(?!http).*?@.*?(?=\\t)");
    Matcher matcher;
    for (int i = 0; i < news.length; i++) {
        br = new BufferedReader(new FileReader(news[i]));
        while ((data = br.readLine()) != null) {
            matcher = pattern.matcher(data);
            if (matcher.find()) {
                writer.append(new Text(matcher.group()), new Text(data  //matcher.group() returns title
                        .replaceAll("^.*?@.*?@.*?\\t|http.*?$", "")     //the value is content 
                        .replaceAll("@|\\s*", " ")));
            }
        }
        br.close();
    }
    writer.sync();
    writer.close();

然后我得到了包含所有文章的序列文件。
下一个代码

int minSupport = 2;
    int minDf = 1;
    int maxDFPercent = 96;
    int maxNGramSize = 1;
    float minLLRValue = LLRReducer.DEFAULT_MIN_LLR;
    int reduceTasks = 1;
    int chunkSize = 200;
    float norm = 2;
    boolean sequentialAccessOutput = false;
    boolean namedVector = false;
    boolean logNormalize = false;
    //here I neglect something inessential
    Class<? extends Analyzer> analyzerClass = IKAnalyzer.class;

    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
            analyzerClass.asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
            new Path(outputDir), tfDirName, conf, minSupport, maxNGramSize,
            minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
            sequentialAccessOutput, namedVector);

    Pair<Long[], List<Path>> docFrequenciesFeatures = TFIDFConverter
            .calculateDF(new Path(tfDirName), new Path(outputDir), conf,
                    chunkSize);

    TFIDFConverter.processTfIdf(new Path(tfDirName), new Path(outputDir),
            conf, docFrequenciesFeatures, minDf, maxDFPercent, norm,
            logNormalize, sequentialAccessOutput, namedVector, reduceTasks);

    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir, "canopy-centroids");
    Path clusterOutput = new Path(outputDir, "clusters");

    CanopyDriver.run(conf, vectorsFolder, canopyCentroids,
            new CosineDistanceMeasure(), 0.7, 0.3, true, 0.1, false);
    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
            "clusters-0-final"), clusterOutput, 0.01, 20, true, 0.1, false);

几分钟后,程序继续运行到tfidfconverter.processtfidf(…),然后processtfidf完成。我得到了文件part-r-00000,它的大小只有90b。
有人知道我犯的错误吗?非常感谢:)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题