hadoop字数统计工作,但不汇总字数

ddrv8njm  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(327)

我正在使用hadoop1.2.1,出于某种原因 Word Count 输出看起来很奇怪:

输入文件:

this is sparta this was sparta hello world goodbye world

hdfs输出:

goodbye 1
hello   1
is  1
sparta  1
sparta  1
this    1
this    1
was 1
world   1
world   1

代码:

public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
} 

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        context.write(key, new IntWritable(sum));
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setJarByClass(WordCount.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}

}

下面是一些相关的控制台输出:

14/01/04 16:17:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/04 16:17:37 INFO input.FileInputFormat: Total input paths to process : 1
14/01/04 16:17:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/04 16:17:37 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/04 16:17:38 INFO mapred.JobClient: Running job: job_201401041506_0013
14/01/04 16:17:39 INFO mapred.JobClient:  map 0% reduce 0%
14/01/04 16:17:45 INFO mapred.JobClient:  map 100% reduce 0%
14/01/04 16:17:52 INFO mapred.JobClient:  map 100% reduce 33%
14/01/04 16:17:54 INFO mapred.JobClient:  map 100% reduce 100%
14/01/04 16:17:55 INFO mapred.JobClient: Job complete: job_201401041506_0013
14/01/04 16:17:55 INFO mapred.JobClient: Counters: 26
14/01/04 16:17:55 INFO mapred.JobClient:   Job Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=6007
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Launched map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     Data-local map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9167
14/01/04 16:17:55 INFO mapred.JobClient:   File Output Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Written=77
14/01/04 16:17:55 INFO mapred.JobClient:   FileSystemCounters
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_READ=123
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_READ=169
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=122037
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=77
14/01/04 16:17:55 INFO mapred.JobClient:   File Input Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Read=57
14/01/04 16:17:55 INFO mapred.JobClient:   Map-Reduce Framework
14/01/04 16:17:55 INFO mapred.JobClient:     Map output materialized bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Map input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce shuffle bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Spilled Records=20
14/01/04 16:17:55 INFO mapred.JobClient:     Map output bytes=97
14/01/04 16:17:55 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
14/01/04 16:17:55 INFO mapred.JobClient:     Combine input records=0
14/01/04 16:17:55 INFO mapred.JobClient:     SPLIT_RAW_BYTES=112
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input groups=7
14/01/04 16:17:55 INFO mapred.JobClient:     Combine output records=0
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce output records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Map output records=10

什么会导致这种情况?我对hadoop非常陌生,所以我不知道该去哪里找。谢谢!

vdzxcuhz

vdzxcuhz1#

你使用的是旧的api签名。在1.x+中,reduce方法改为使用iterables而不是iterator(这是旧的0.xapi所使用的,因此您将在书籍和web上的许多示例中看到iterator)。
http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapreduce/reducer.html#reduce%28keyin,%20java.lang.iterable,%20org.apache.hadoop.mapreduce.reducer.context%29
尝试

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
}

@override注解告诉编译器检查reduce方法是否重写父类中正确的方法签名。

相关问题