如何在hadoop map reduce中编写avro输出？

dly7yett 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(331)

我写了一个hadoop字数计算程序 TextInputFormat 输入并以avro格式输出字数。
map reduce作业运行良好，但使用unix命令（如 more 或者 vi . 我希望这个输出是不可读的，因为avro输出是二进制格式的。
我只使用了mapper，reducer不存在。我只想用avro做实验，所以我不担心内存或堆栈溢出。遵循mapper的代码

public class WordCountMapper extends Mapper<LongWritable, Text, AvroKey<String>, AvroValue<Integer>> {

    private Map<String, Integer> wordCountMap = new HashMap<String, Integer>();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] keys = value.toString().split("[\\s-*,\":]");
        for (String currentKey : keys) {
            int currentCount = 1;
            String currentToken = currentKey.trim().toLowerCase();
            if(wordCountMap.containsKey(currentToken)) {
                currentCount = wordCountMap.get(currentToken);
                currentCount++;
            }
            wordCountMap.put(currentToken, currentCount);
        }
        System.out.println("DEBUG : total number of unique words = " + wordCountMap.size());
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        for (Map.Entry<String, Integer> currentKeyValue : wordCountMap.entrySet()) {
            AvroKey<String> currentKey = new AvroKey<String>(currentKeyValue.getKey());
            AvroValue<Integer> currentValue = new AvroValue<Integer>(currentKeyValue.getValue());
            context.write(currentKey, currentValue);
        }
    }
}

司机代码如下：

public int run(String[] args) throws Exception {

    Job avroJob = new Job(getConf());
    avroJob.setJarByClass(AvroWordCount.class);
    avroJob.setJobName("Avro word count");

    avroJob.setInputFormatClass(TextInputFormat.class);
    avroJob.setMapperClass(WordCountMapper.class);

    AvroJob.setInputKeySchema(avroJob, Schema.create(Type.INT));
    AvroJob.setInputValueSchema(avroJob, Schema.create(Type.STRING));

    AvroJob.setMapOutputKeySchema(avroJob, Schema.create(Type.STRING));
    AvroJob.setMapOutputValueSchema(avroJob, Schema.create(Type.INT));

    AvroJob.setOutputKeySchema(avroJob, Schema.create(Type.STRING));
    AvroJob.setOutputValueSchema(avroJob, Schema.create(Type.INT));

    FileInputFormat.addInputPath(avroJob, new Path(args[0]));
    FileOutputFormat.setOutputPath(avroJob, new Path(args[1]));

    return avroJob.waitForCompletion(true) ? 0 : 1;
}

我想知道avro输出是什么样子的，我在这个程序中做错了什么。

Java hadoop avro word-count

来源：https://stackoverflow.com/questions/21876062/how-to-write-avro-output-in-hadoop-map-reduce